600.465 Introduction to NLP (Fall 1999)

Final Exam

Date: Dec 21 9am (90 min.)

SSN:__________________________________

Name:_________________________________

If asked to compute something for which you have the numbers, that really means to compute the final number, not just to write the formula¹. If asked for a formula, write down the formula.

1. The Basics: Probability, Smoothing Entropy, and Cross-Entropy

Suppose your training data contains (only) the sentence (http://www.petpainter.com, slightly adapted)

We specialize in petbirds including but not limited to dogs , cats , and horses .

You are building a bigram language model p(w_i|w_i-1) using maximum likelihood estimation. You do not add any special symbols/words to the beginning and end of the data (i.e., you use separate unigram distribution for the first position in any given data).

What is the (raw) probability of p(cats|,)? ________________________________

What is the definition of the training data probability, using the bigram approximation?

________________________________________________________

What is the entropy and perplexity of the raw bigram distribution?

___________________________

Now let's turn to the issues of smoothing and cross-entropy. Suppose this is your test data (http://hometown.aol.com/~arkful):

We also have gift baskets for dogs , cats , and horses and their owners .

What is the cross-entropy of the raw distribution extracted from the above training data if applied to this test data?

__________________________________________________

What is the cross-entropy on the same test data, when you use a linearly smoothed distribution p_µ(w_i|w_i-1) = µ p(w_i|w_i-1) + (1- µ)/32. Suppose the parameter µ has been set to 0.5 on some heldout data, and that if you have "undefined value" for p(w_i|w_i-1) from the training data (either because you have not seen the history w_i-1or you are dealing with the first position of the test data), you take p(w_i|w_i-1) = 1/32.

__________________________________________________

2. Maximum Entropy

What is the general form of so-called Maximum Entropy (ME) model?

p(y|x) = ____________________________________________

Let's assume S = {f₁, f₂, ..., f_n} are all the binary valued (0/1) features you are supposed to use (and find the weights of). What is the general form of the constraints put on the model?

_______________________________________

Suppose you use only two binary (0/1) valued features, f and g, defined as follows:

f(y,x) = 1 iff y = , or and, and there is no other comma to the left of y (within the same sentence).
g(y,x) = 1 iff y = and, and there is a comma immediately in front of it.

Write down the constraints of the corresponding ME model, using the training data from the part 1:

We specialize in petbirds including but not limited to dogs , cats , and horses .
__________________________________________________________

__________________________________________________________

What can you say about the weights associated with these two features, after performing the maximum netropy training? (If possible, compute their values and write them down, too.)

lambda_f = __________________________________________________________

lambda_g = __________________________________________________________

3. Tagging

Describe three different approaches to tagging. Include formulas if/where appropriate.

1. ____________________________________________________________________________

______________________________________________________________________________

2. ____________________________________________________________________________

______________________________________________________________________________

3. ____________________________________________________________________________

______________________________________________________________________________

4. Shift-Reduce Parsing

Let's assume the following Context-Free Grammar G (here with seven numbered rules):

#1 S -> A R

#2 A -> A B

#3 A -> a

#4 B -> b

#5 R -> B R C

#6 R -> r

#7 C -> c

Decide which of the following strings belong to the language defined by the grammar G:

aaaabr Yes/No: ____________

abbbrcc Yes/No: ____________

ar Yes/No: ____________

Create the shift-reduce parse tables for this grammar.

__________________________________________________________________________________

Show how the shift-reduce parser parses the string abbbrc:

Input stack State tack Rule (Reduction) Backtracking stack

__________________________________________________________________________________

5. Statistical Parsing

What is a Probabilistic CFG? Use five sentences at most.

__________________________________________________________________________________

Now suppose the following "treebank" data, where the pair of symbols ( ... )XX denotes a (sub)tree rooted in a nonterminal XX.

Sentence 1:

(((the)DET (dog)N)NP ((saw)V (((the)DET (cat)N)NP ((with)PREP (((a)DET (telescope)N)NP ((with)PREP ((a)DET (mirror)N)NP)PP)NP)PP)M2)VP)S

Sentence 2:

(((the)DET (dog)N)NP ((saw)V (((the)DET (cat)N)NP ((with)PREP ((a)DET (mirror)N)NP)PP )NP)VP)S

Draw the usual parse trees, with terminal symbols (words) as leaves, and other nodes labeled by nonterminals:

the dog saw the cat with a telescope with a mirror

the dog saw the cat with a mirror

Write down the (general, non-lexicalized) CFG grammar G as extracted from this training data.

#01 ______________________________

#02 ______________________________

#03 ______________________________

#04 ______________________________

#05 ______________________________

#06 ______________________________

#07 ______________________________

#08 ______________________________

#09 ______________________________

#10 ______________________________

#11 ______________________________

#12 ______________________________

#13 ______________________________

#14 ______________________________

#15 ______________________________

Now draw the lexicalized parse tree for the second sentence, assuming that whenever there is a choice of heads in any given subtree, chose (in the following precedence order): VP, V, PREP, N, NP.

the dog saw the cat with a mirror

OK. Now you have everything to answer these "numerical" questions:

Estimate p(NP -> NP PP) = _______________

Estimate p(NP -> PREP N) = _______________

Estimate p(VP -> V NP) = _______________

Estimate p(VP -> V M2) = _______________

Suppose we stick to the usual three independence asumptions and define a probability of a parse tree s (of a sentence W) as a simple product of p(<rule>) over all rules used in the course of generation of the sentence W.

Describe the main idea of the algorithm to compute:

(a) the best parse given a sentence and a model. Use five sentences at most.

__________________________________________________________________________________

(b) the probability of a string given the model. Use five sentences at most.

__________________________________________________________________________________

What is then the probablity of the following parse tree, using the estimated nonlexicalized PCFG:

p( (((the)DET (dog)N)NP ((saw)V ((the)DET (cat)N)NP)VP)S ) = ______________

What is the probability of the training data parse of sentence 2?

p(s|sentence₂) = _____________

What is the probability of the string "the dog saw the cat with a mirror"?

p(sentence₂) = _____________

Now a lexicalized grammar question. What is the (raw) probability estimate of the lexicalized rule p(PP(with) -> PREP(with) NP(mirror))? [Imagine you generate the lexicalized rules, too, under the same independence assumptions; however, to answer this question, there is no need to generate them all, of course. However, do not forget to look at both training sentences!]

p(PP(with) -> PREP(with) NP(mirror)) = __________________

And that's it for today!
============================================================================

Now check if you have filled in your name and SSN. Also, please carefully check your answers and hand the exam in.

¹ Use of calculators is allowed.