Jurafsky Martin Chapter 12: Constituency Grammars

Metadata

Title: Constituency Grammars
Number: 12
Book: Jurafsky and Martin: Speech and Language Processing

Core Ideas

The word syntax comes from the Greek syntaxis which means “setting out together or arrangement” and refers to the way words are arranged together.

The bulk of this chapter explores Context-free grammars, which are the backbone of many formal models of natural language syntax and play a role in many computational applications, including grammar checking, semantic interpretation, dialogue understanding and machine translation.

CFGs are powerful, but computationally tractable and efficient parsing algorithms exist.

Constituency

Syntactic constituency is the idea that groups of words can behave as single units, or constituents. Part of developing a grammar involves building an inventory of the constituents in the language.

Consider a noun phrase, a sequence of words surrounding at least one noun. Such as: “Harry the Horse”, “three parties from Brooklyn”, “the reason he comes into the Hot Box”, “a high class spot such as Mindy’s”, “they”.

Why do we say that these words group together? One reason is they all appear in similar syntactic environments, eg before a verb.

Other evidence comes from preposed or postposed constructions. Eg the prepositional phrase on September seventeenth can be placed in a number of places (including before and after) the phrase I’d like to fly from Atlanta to Denver but the individual parts can’t be.

Context Free Grammars

The chapter then explores Context Free Grammars

English Grammar Rules

The chapter then explores English Grammar Rules

Treebanks

With a sufficiently robust grammar, we can assign a parse tree to any sentence. A treebank is a corpus that pairs every sentence with a corresponding parse tree.

The general mode of construction is to use a parser to automatically parse the corpus, and then have a human linguist hand-correct the parses.

One of the best known is the Penn Treebank. See the list of Penn Treebank Constituent Tags

One issue we have to deal with is syntactive movement. For example take quotations. We might think of a standard form like “He said we have to go to the movies tomorrow”. But we could see it ordered like “we have to go to the movies tomorrow, he said”.

Some treebanks will annotate this second case in a special way, with a node like -NONE- indicating the empty spot after “said” where the quotation would usually fall.

That kind of annotation can help some parsers recover the semantic relations. Again see the list of tags.

We can extract grammars from treebanks. They typically yield very large numbers of flat rules. This can present a challenge for probabilistic parsing. This is considered further in Appendix C (pdf) of the book.

Heads

The book then discusses how we find a lexical head of a syntactic constituent. This is linguistically controversial in many cases.

In practice, we often apply some set of handwritten rules once a sentence is parsed to identify its head. Examples of such rules are given in the book, those used in the Penn Treebank.

Grammar Equivalence and Normal Form

The book then reviews grammatical equivalence and Chomsky Normal Form. This was covered in other modules so not repeated here.

Lexicalized Grammars

The final part of the chapter reviews the issues with phrase-structure rules as the basis of grammars, which we’ve seen so far.

Such grammars can lead to clunky solutions for issues like agreement, long-distance dependencies, and subcategorization of verbs.

Alternative approaches have attempted to make more use of the lexicon rather than just the rules. These are called lexicalized grammars.

One such approach is dependency grammars in the next chapter. Another is Combinatory Categorical Grammar.

Alex's Notes