Natural Language - Mon, Dec 2

This is John Denero's research area. None of the content from this lecture will be on the final.

Ambiguity

In programming languages, each expression only has one interpretation. One sequence of command can only be interpreted one way -- programming languages are unambiguous.

Real languages, however, are ambiguous. The same sequence of words can be interpreted multiple ways.

Programs must be written for people to read

Preface of Structure and Interpretation of Computer Programs

This sentence has a noun phrase (Programs) and a verb phrase (must be written for people to read), the latter containing a subordinate clause (for people to read). Or, it has a noun phrase (Programs), a verb phrase (must be written), and a subordinate clause (for people to read). Or, it has a noun phrase (Programs), a verb phrase (must be written for people), and another verb phrase (to read).

To find the structure of this sentence, a computer must perform syntactic parsing. There's more to it than that; a computer needs to figure out what a word means in context of the sentence.

Syntax Trees

Representing Syntactic Structure

The input is "cows intimidate cows." The following Tree-Leaf structure is not the same as it has been throughout the semester.

A Tree represents a phrase:

A Leaf represents a single word:

Let's use these classes to represent our sentence:

cows = Leaf('N', 'cows')
intimidate = Leaf('V', 'intimidate')
S, NP, VP = 'S', 'NP', 'VP'
Tree(S, [Tree(NP, [cows]),
         Tree(VP, [intimidate,
                   Tree(NP, [cows])])])

Note that Denero's print_tree function prints a tree that looks suspiciously like a bunch of Scheme lists...

Context-Free Grammar Rules

A grammar rule describes how a tag can be expanded as a sequence of tags of words. For example, a sentence can be expanded as a noun phrase then a verb phrase.

Parsing

Exhaustive Parsing

Expand all tags recursively, but constrain words to match input. The input is "buffalo buffalo buffalo buffalo."

Learning

One technique for learning is scoring trees using relative frequencies. While a large number of syntactic structures may exist, not all are equally common. There will generally be one (or a few) structure(s) that is/are more likely to occur than others.