Context Free Grammars (Jurafsky Martin)
The most widely used formal system for modeling constituent structure in natural languages is the Context Free Grammar or CFG.
CFGs are also called Phrase Structure Grammars and the formalism is equivalent to Backaus-Naur Form or BNF. It was formalized independently by Chomsky and Backus.
A CFG consists of a set of rules or productions, each expressing the ways symbols of the language can be grouped and ordered together, and a lexicon of words and symbols.
The symbols that are used in a CFG are in two classes. Terminal symbols correspond to words in the language. The lexicon is the set of rules that introduce these terminal symbols. Non-terminal symbols express abstractions over these terminals.
Each context free rule has an arrow, to the left of which is a single non-terminal symbol expressing some cluster or generalization. To the right is an ordered list of one or more terminals and non-terminals. The non-terminal associated with each word in the lexicon is its part of speech, or lexical category.
A CFG can be thought of either as a device for generating sentences, or a device for assigning a structure to a given sentence.
For the former, read the arrow as “rewrite the symbol on the left with the string of symbols on the right”. The sequence of expansions that results is called a derivation of the string of words. We often represent this as a parse tree, with the root at the top.
The formal language defined by a CFG is the set of strings derivable from the designated start symbol, each grammar must have one, often called S.
\[S \rightarrow NP \quad VP\]
A CFG defines a formal language, ie a set of strings. Sentences that can be derived by a grammar are in the language and are called grammatical. Sentences that are not in the language are ungrammatical. This is called generative grammar as the language is defined by the set of sentences generated by the grammar.
Formal Definition
Formally a CFG is defined by four parameters:
\(N\) A set of non-terminal symbols or variables
\(\Sigma\) A set of terminal symbols (disjoint from \(N\))
\(R\) A set of rules each of the form \(A \rightarrow \beta\) where
\(A\) is a non terminal
\(\beta\) is a string of symbols from the infinite set \((\Sigma \cup N)^*\)
\(S\) a designated start symbol, from \(N\)
A language is defined through derivation. A string derives another string if it can be rewritten as the second one by some series of rule applications.
We start with the concept of direct derivation, from Hopcroft and Ullman:
if \(A \rightarrow \beta\) is a production of \(R\) and \(\alpha\) and \(\gamma\) are any strings in the set \((\Sigma \cup N)^*\), then we say that \(\alpha A \gamma\) directly derives \(\alpha \beta \gamma\), or \(\alpha \beta \gamma \; \Rightarrow \; \alpha \beta \gamma\)
We can then generalize from this notion of direct derivation.
Let \(\alpha_1, \alpha_2, \dots, \alpha_m\) be strings in \((\Sigma \cup N)^*\), with \(m \geq 1\) such that:
\[\alpha_1 \Rightarrow \alpha_2, \alpha_2 \Rightarrow \alpha_3, \dots, \alpha_{m-1} \Rightarrow \alpha_m\]
We can then say that \(\alpha_1\) derives \(\alpha_m\), or \(\alpha_1 \xRightarrow{*} \alpha_m\).
From here we can define a language \(\mathcal{L}_G\) generated by a grammar \(G\) with a designated start symbol \(S\):
\[\mathcal{L}_G = \{w \; | \; w \; is \; in \; \Sigma^* \; \text{and} \; S \xRightarrow{*} \;w \} \]