4.1: Context-free Grammars

Carol Critchlow & David J. Eck
Hobart and William Smith Colleges

In its most general form, a grammar is a set of rewriting rules. A rewriting rule specifies that a certain string of symbols can be substituted for all or part of another string. If \(w\) and \(u\) are strings, then \(w \longrightarrow u\) is a rewriting rule that specifies that the string w can be replaced by the string u. The symbol \(" \longrightarrow\) " is read "can be rewritten as." Rewriting rules are also called production rules or productions, and \(" \longrightarrow\) can also be read as “produces.” For example, if we consider strings over the alphabet , then the production rule \(a b a \longrightarrow c c\) can be applied to the string abbabacto give the string abbccc. The substring aba in the string abbabac has been replaced with cc. In a context-free grammar, every rewriting rule has the form \(A \longrightarrow\)w, where A is single symbol and w is a string of zero or more symbols. (The grammar is “context-free” in the sense that w can be substituted for A wher- ever A occurs in a string, regardless of the surrounding context in which Aoccurs.) The symbols that occur on the left-hand sides of production rules in a context-free grammar are called non-terminal symbols. By convention, the non-terminal symbols are usually uppercase letters. The strings on the right-hand sides of the production rules can include non-terminal symbols as well as other symbols, which are called terminal symbols. By convention, the terminal symbols are usually lowercase letters. Here are some typical production rules that might occur in context-free grammars: \(A \longrightarrow a A b B\)
\(S \longrightarrow S S\)
\(C \longrightarrow A c c\)
\(B \longrightarrow b\)
\(A \longrightarrow \varepsilon\) In the last rule in this list, ε represents the empty string, as usual. For example, this rule could be applied to the string aBaAcA to produce the string aBacA. The first occurrence of the symbol A in aBaAcA has been replaced by the empty string—which is just another way of saying that the symbol has been dropped from the string. In every context-free grammar, one of the non-terminal symbols is designated as the start symbol of the grammar. The start symbol is often, though not always, denoted by S. When the grammar is used to generate strings in a language, the idea is to start with a string consisting of nothing but the start symbol. Then a sequence of production rules is applied. Each application of a production rule to the string transforms the string to a new string. If and when this process produces a string that consists purely of terminal symbols, the process ends. That string of terminal symbols is one of the strings in the language generated by the grammar. In fact, the language consists precisely of all strings of terminal symbols that can be produced in this way. As a simple example, consider a grammar that has three production rules: \(S \longrightarrow a S, S \longrightarrow b S,\) and \(S \longrightarrow b\). Inthisexample,Sistheonly non-terminal symbol, and the terminal symbols are a and b. Starting from the string S, we can apply any of the three rules of the grammar to produce either aS, bS, or b. Since the string b contains no non-terminals, we see that b is one of the strings in the language generated by this grammar. The strings aS and bS are not in that language, since they contain the non- terminal symbol S, but we can continue to apply production rules to these strings. From aS, for example, we can obtain aaS, abS, or ab. From abS, we go on to obtain abaS, abbS, or abb. The strings ab and abb are in the language generated by the grammar. It’s not hard to see that any string of a’s and b’s that ends with a b can be generated by this grammar, and that these are the only strings that can be generated. That is, the language generated by this grammar is the regular language specified by the regular expression \((a | b)^ b\). It’s time to give some formal definitions of the concepts which we have been discussing.

Definition 4.1. A context-free grammar is a 4-tuple \((V, \Sigma, P, S),\) where: 1. V is a finite set of symbols. The elements of V are the non-terminal symbols of the grammar. 2.Σ is a finite set of symbols such that \(V \cap \Sigma=\emptyset\). The elements of Σ are the terminal symbols of the grammar. 3. P is a set of production rules. Each rule is of the form \(A \longrightarrow w\) where A is one of the symbols in V and w is a string in the language \((V \cup \Sigma)^\). 4. \(S \in V . S\) is the start symbol of the grammar.

Even though this is the formal definition, grammars are often specified informally simply by listing the set of production rules. When this is done it is assumed, unless otherwise specified, that the non-terminal symbols are just the symbols that occur on the left-hand sides of production rules of the grammar. The terminal symbols are all the other symbols that occur on the right-hand sides of production rules. The start symbol is the symbol that occurs on the left-hand side of the first production rule in the list.Thus, the list of production rules \(T \longrightarrow T T\)
\(T \longrightarrow A\)
\(A \longrightarrow a A a\)
\(A \longrightarrow b B\)
\(B \longrightarrow b B\)
\(B \longrightarrow \varepsilon\) specifies a grammar \(G=(V, \Sigma, P, T)\) where \(V\) is \(\, \Sigma\) is \(\,\) and T is the start symbol. P, of course, is a set containing the six production rules in the list. Let \(G=(V, \Sigma, P, S)\) be a context-free grammar. Suppose that \(x\) and \(y\) are strings in the language \((V \cup \Sigma)^ .\) The notation \(x \Longrightarrow_ y\) is used to express the fact that y can be obtained from x by applying one of the production rules in P . To be more exact, we say that \(x \Longrightarrow G\) y if and only if there is a production rule \(A \longrightarrow w\) in the grammar and two strings uand v in the language \((V \cup \Sigma)^\) such that x = uAv and y = uwv. The fact that x = uAv is just a way of saying that A occurs somewhere in x. When the production rule \(A \ is applied to substitute w for A in uAv, the result is uwv, which is y. Note that either u or v or both can be the empty string. If a string y can be obtained from a string x by applying a sequence of zero or more production rules, we write \(x \Longrightarrow_^\) y. In most cases, the “G” in the notations \(\Longrightarrow G\) and \(\Longrightarrow_^\) will be omitted, assuming that the grammar in question is understood. Note that \(\Longrightarrow\) is a relation on the set \((V \cup \Sigma)^\). The relation \(\Longrightarrow^\) is the reflexive, transitive closure of that relation. (This explains the use of “∗”, which is usually used to denote the transitive, but not necessarily reflexive, closure of a relation. In this case, \(\Longrightarrow^\) is reflexive as well as transitive since x =⇒∗x is true for any string x.) For example, using the grammar that is defined by the above list of production rules, we have \(\begin a T B & \Longrightarrow a T T B \\ & \Longrightarrow a T A B \\ & \Longrightarrow a T A b B \\ & \Longrightarrow a T b B b B \\ & \Longrightarrow a T b b B \end\) From this, it follows that \(a T B \Longrightarrow^ a T b b B .\) The relation \(\Longrightarrow\) is read "yields" or "produces" while \(\Longrightarrow\) can be read "yields in zero or more steps” or “produces in zero or more steps.” The following theorem states some simple facts about the relations \(\Longrightarrow\) and \(\Longrightarrow^ :\)

Theorem 4.1. Let \(G\) be the context-free grammar \((V, \Sigma, P, S) .\) Then: 1. If \(x\) and \(y\) are strings in \((V \cup \Sigma)^\) such that \(x \Longrightarrow y,\) then \(x \Longrightarrow^ y\) 2. If \(x, y,\) and \(z\) are strings in \((V \cup \Sigma)^\) such that \(x \Longrightarrow^ y\) and \(y \Longrightarrow^ z\) then \(x \Longrightarrow^ z\) 3. If \(x\) and \(y\) are strings in \((V \cup \Sigma)^\) such that \(x \Longrightarrow y,\) and if \(s\) and \(t\) are any strings in \((V \cup \Sigma)^,\) then \(s x t \Longrightarrow s y t\) 4. If \(x\) and \(y\) are strings in \((V \cup \Sigma)^\) such that \(x \Longrightarrow^ y,\) and if \(s\) and \(t\) are any strings in \((V \cup \Sigma)^,\) then sxt \(\Longrightarrow^ s y t\)

Proof. Parts 1 and 2 follow from the fact that \(\Longrightarrow^\) is the transitive closure of \(\Longrightarrow\) Part 4 follows easily from Part \(3 .\) (I leave this as an exercise.) To prove Part 3, suppose that x, y, s, and t are strings such that \(x \Longrightarrow y\). By definition, this means that there exist strings u and v and a production rule \(A \longrightarrow w\) such that \(x=u A v\) and \(y=u w v .\) But then we also have \(s x t=\) suAvt and syt = suwvt. These two equations, along with the existence of the production rule \(A \longrightarrow w\) show, by definition, that \(s x t \Longrightarrow s y t\). We can use \(\Longrightarrow^\) to give a formal definition of the language generated by a context-free grammar:

Definition 4.2. Suppose that \(G=(V, \Sigma, P, S)\) is a context-free grammar. Then the language generated by G is the language L(G) over the alphabet Σ defined by \(L(G)=\left\ \) That is, L(G) contains any string of terminal symbols that can be obtained by starting with the string consisting of the start symbol, S, and applying a sequence of production rules.

A language L is said to be a context-free language if there is a context-free grammar G such that L(G) is L. Note that there might be many different context-free grammars that generate the same context-free language. Two context-free grammars that generate the same language are said to be equivalent. Suppose G is a context-free grammar with start symbol S and suppose \(w \in L(G)\). By definition, this means that there is a sequence of one or more applications of production rules which produces the string w from S. This sequence has the form \(S \Longrightarrow x_ \Longrightarrow x_ \Longrightarrow \cdots \Longrightarrow w\). Such a sequence is called a derivation of w (in the grammar G). Note that w might have more than one derivation. That is, it might be possible to produce w in several different ways. Consider the language \(L=\left\ b^ | n \in \mathbb\right\> .\) We already know that \(L\) is not a regular language. However, it is a context-free language. That is, there is a context-free grammar such that L is the language generated byG. This gives us our first theorem about grammars:

Theorem 4.2. Let \(L\) be the language \(L=\left\ b^ | n \in \mathbb\right\> .\) Let \(G\) be the context-free grammar \((V, \Sigma, P, S)\) where \(V=\, \Sigma=\\) and \(P\) consists of the productions \(S \longrightarrow a S b\)
\(S \longrightarrow \varepsilon\) Then L = L(G), so that L is a context-free language. In particular, there exist context-free languages which are not regular.

Theorem 4.3. Suppose that L and M are context-free languages. Then the languages \(L \cup M, L M,\) and \(L^\) are also context-free.

Proof. I will prove only the first claim of the theorem, that \(L \cup M\) is context-free. In the exercises for this section, you are asked to construct grammars for LM and \(L^\) (without giving formal proofs that your answers are correct). Let \(G=(V, \Sigma, P, S)\) and \(H=(W, \Gamma, Q, T)\) be context-free grammars such that \(L=L(G)\) and \(M=L(H) .\) We can assume that \(W \cap V=\emptyset\), since otherwise we could simply rename the non-terminal symbols in W. The idea of the proof is that to generate a string in \(L \cup M\), we first decide whether we want a string in L or a string in M . Once that decision is made, to make a string in L, we use production rules from G, while to make a string in M, we use rules from H. We have to design a grammar, K, to represent this process. Let \(R\) be a symbol that is not in any of the alphabets \(V, W, \Sigma,\) or \(\Gamma .\) R will be the start symbol of K. The production rules for K consist of all the production rules from G and H together with two new rules: \(R \longrightarrow S\)
\(R \longrightarrow T\) Formally, K is defined to be the grammar \((V \cup W \cup\, P \cup Q \cup\, \Sigma \cup \Gamma, R)\) Suppose that \(w \in L .\) That is \(w \in L(G),\) so there is a derivation \(S \Longrightarrow_^ w\). since every rule from \(G\) is also a rule in \(K,\) if follows that \(S \Longrightarrow_^ w\). Combining this with the fact that \(R \Longrightarrow_ S,\) we have that \(R \Longrightarrow_^ w,\) and \(w \in L(K) .\) This shows that \(L \subseteq L(K) .\) In an exactly similar way, we can show that \(M \subseteq L(K) .\) Thus, \(L \cup M \subseteq L(K)\). It remains to show that \(L(K) \subseteq L \cup M .\) Suppose \(w \in L(K) .\) Then there is a derivation \(R \Longrightarrow_^ w .\) This derivation must begin with an application of one of the rules \(R \longrightarrow S\) or \(R \longrightarrow T,\) since these are the only rules in which \(R\) appears. If the first rule applied in the derivation is \(R \longrightarrow S\), then the remainder of the derivation shows that \(S \Longrightarrow_^ w .\) Starting from S, the only rules that can be applied are rules from G, so in fact we have \(S \Longrightarrow_^ w .\) This shows that \(w \in L .\) Similarly, if the first rule applied in the derivation \(R \Longrightarrow_^ w\) is \(R \longrightarrow T,\) then \(w \in M .\) In any case, \(w \in L \cup M\). This proves that \(L(K) \subseteq L \cup M\). Finally, we should clarify the relationship between context-free lan- guages and regular languages. We have already seen that there are context- free languages which are not regular. On the other hand, it turns out that every regular language is context-free. That is, given any regular language, there is a context-free grammar that generates that language. This means that any syntax that can be expressed by a regular expression, by a DFA, or by an NFA could also be expressed by a context-free grammar. In fact, we only need a certain restricted type of context-free grammar to duplicate the power of regular expressions.

Definition 4.3. A right-regular grammar is a context-free grammar in which the right-hand side of every production rule has one of the following forms: the empty string; a string consisting of a single non-terminal symbol; or a string consisting of a single terminal symbol followed by a single non- terminal symbol.

Examples of the types of production rule that are allowed in a right- regular grammar are \(A \longrightarrow \varepsilon, B \longrightarrow C,\) and \(D \longrightarrow a E\). Theideaof the proof is that given a right-regular grammar, we can build a corresponding NFA and vice-versa. The states of the NFA correspond to the non-terminal symbols of the grammar. The start symbol of the grammar corresponds to the starting state of the NFA. A production rule of the form \(A \longrightarrow b C\) corresponds to a transition in the NFA from state \(A\) to state \(C\) while reading the symbol \(b .\) A production rule of the form \(A \longrightarrow B\) corresponds to an ε-transition from state A to state B in the NFA. And a production rule of the form \(A \longrightarrow \varepsilon\) exists in the grammar if and only if A is a final state in the NFA. With this correspondence, a derivation of a string w in the grammar corresponds to an execution path through the NFA as it accepts the string w. I won’t give a complete proof here. You are welcome to work through the details if you want. But the important fact is:

Theorem 4.4. A language L is regular if and only if there is a right-regular grammar G such that L = L(G). In particular, every regular language is context-free.

Find a context-free grammar that generates the language \(\left\^ | n_(w)=\right.\)\(n_(w) \>\).
A palindrome is a string that reads the same backwards and forwards, such as “mom”, “radar”, or “aabccbccbaa”. That is, w is a palindrome if \(w=w^\). Let \(L=\left\^ | w \text < is a palindrome >\right\>\). Show that L is a context-free language by finding a context-free grammar that generates L.
Let \(\Sigma=\ .\) That is, \(\Sigma\) is the alphabet consisting of the four symbols (, ), [, and ]. Let L be the language over Σ consisting of strings in which both parentheses and brackets are balanced. For example, the string([][()()])([]) is in L but [(]) is not. Find a context-free grammar that generates the language L.
Suppose that G and H are context-free grammars. Let L = L(G) and let M =L(H). Explain how to construct a context-free grammar for the language LM. You do not need to give a formal proof that your grammar is correct.
Suppose that G is a context-free grammar. Let L = L(G). Explain how to construct a context-free grammar for the language \(L^\). You do not need to give a formal proof that your grammar is correct.
Suppose that \(L\) is a context-free language. Prove that \(L^\) is a context-free language. (Hint: Given a context-free grammar G for L, make a new grammar, \(G^,\) by reversing the right-hand side of each of the production rules in G. That is, \(A \longrightarrow w\) is a production rule in \(G\) if and only if \(A \longrightarrow w^\) is a production rule in \(G^ . )\)
Define a left-regular grammar to be a context-free grammar in which the right-hand side of every production rule is of one of the following forms: the empty string; a single non-terminal symbol; or a non-terminal symbol followed by a terminal symbol. Show that a language is regular if and only if it can be generated by a left-regular grammar. (Hint: Use the preceding exercise and Theorem 4.4.)

This page titled 4.1: Context-free Grammars is shared under a CC BY-NC-SA license and was authored, remixed, and/or curated by Carol Critchlow & David J. Eck.

Was this article helpful?
Yes
No