
Cooper K.Engineering a compiler
.pdf
2.10. LEXICAL FOLLIES OF REAL PROGRAMMING LANGUAGES 49
Digression: The Hazards of Bad Lexical Design
An apocryphal story has long circulated in the compiler construction community. It suggests than an early Nasa mission to Mars (or Venus, or the Moon,) crashed because of a missing comma in the Fortran code for a do loop. Of course, the body of the loop would have executed once, rather than the intended number of times. While we doubt the truth of the story, it has achieved the status of an “urban legend” among compiler writers.
it occurs as a word on its own. Similarly, B expands to 2. Again, the scanner must rely on the six character limit.
With the parameters expanded, line three scans as IMPLICIT CHARACTER*4 (A-B). It tells the compiler that any variable beginning with the letters A or B has the data type of a four-character string. Of course, the six character limit makes it possible for the scanner to recognize IMPLICIT.
Line four has no new lexical complexity. It declares INTEGER arrays of ten element named FORMAT and IF, and a scalar INTEGER variable named DO9E1.
Line five begins with a statement label, 100. Since Fortran was designed for punch cards, it has a fixed-field format for each line. Columns 1 through 5 are reserved for statement labels; a C in column 1 indicates that the entire line is a comment. Column 6 is empty, unless the line is a continuation of the previous line. The remainder of line five is a FORMAT statement. The notation 4H)=(3 is a “Hollerith constant.” 4H indicates that the following four characters form a literal constant. Thus, the entire line scans as:
label, 100 , format keyword , ’(’ constant, ‘‘)=(3’’ ’)’
This is a FORMAT statement, used to specify the way that characters are read or written in a READ, WRITE, or PRINT statement.
Line six is an assignment of the value 3 to the fourth element of the INTEGER array FORMAT, declared back on line 4. To distinguish between the variable and the keyword, the scanner must read past the (4 ) to reach the equals sign. Since the equals sign indicates an assignment statement, the text to its left must be a reference to a variable. Thus, FORMAT is the variable rather than the keyword. Of course, adding the H from line 5 would change that interpretation.
Line 7 assigns the value 1 to the variable DO9E1, while line 8 marks the beginning of a DO loop that ends at label 9 and uses the induction variable E1. The di erence between these lines lies in the comma following =1. The scanner cannot decide whether DO9E1 is a variable or the sequence keyword,DO ,label,9 , variable,E1 until it reaches either the comma or the end of the line.
The next three lines look quite similar, but scan di erently. The first is an assignment of the value 1 to the Xth element of the array IF declared on line 4. The second is an IF statement that assigns 1 to H if X evaluates to true. The third branches to either label 200 or 300, depending on the relationship between the value of X and zero. In each case, the scanner must proceed well beyond the IF before it can classify IF as a variable or a keyword.
50 |
CHAPTER 2. LEXICAL ANALYSIS |
|
The final complication begins on line 12. Taken by itself, the line appears |
to be an END statement, which usually appears at the end of a procedure. It is followed, on line 13, by a comment. The comment is trivially recognized by of the character in column 1. However, line 14 is a continuation of the previous statement, on line 12. To see this, the scanner must read line 13, discover that it is a comment, and read line 13 to discover the $ in column 6. At this point, it finds the string FILE. Since blanks (and intervening comment cards) are not significant, the word on line 12 is actually ENDFILE, split across an internal comment. Thus, lines 12 and 14 form an ENDFILE statement that marks the file designated as 1 as finished.
The last line, 15, is truly an END statement.
To scan this simple piece of Fortran text, the scanner needed to look arbitrarily far ahead in the text—limited only by the end of the statement. In the process, it applied idiosyncratic rules related to identifier length, to the placement of symbols like commas and equal signs. It had to read to the end of some statements to categorize the initial word of a line.
While these problems in Fortran are the result of language design from the late 1950’s, more modern languages have their own occasional lexical lapses. For example, pl/i, designed a decade later, discarded the notion of reserved keywords. Thus, the programmer could use words like if, then, and while as variable names. Rampant and tasteless use of that “feature” led to several examples of lexical confusion.
if then then then = else; else else = then;
This code fragment is an if-then-else construct that controls assignments between two variables named then and else. The choice between the then-part and the else-part is based on an expression consisting of a single reference to the variable then. It is unclear why anyone would want to write this fragment.
More di cult, from a lexical perspective, is the following set of statements.
declare (a1,a2,a3,a4) fixed binary; declare (a1,a2,a3,a4) = 2;
The first declares four integer variables, named a1, a2, a3, and a4. The second is an assignment to an element of a four-dimensional array named declare.
(It presupposes the existence of a declaration for the array.) This example exhibits a Fortran-like problem. The compiler must scan to = before discovering whether declare is a keyword or an identifier. Since pl/i places no limit on the comma-separated list’s length, the scanner must examine an arbitrary amount of right context before it can determine the syntactic category for declare. This complicates the problem of bu ering the input in the scanner.
As a final example, consider the syntax of c++, a language designed in the late 1980s. The template syntax of c++ allows the fragment
PriorityQueue<MyType>
If MyType is itself a template, this can lead to the fragment
2.11. SUMMARY AND PERSPECTIVE |
51 |
PriorityQueue<MyType<int>>
which seems straight forward to scan. Unfortunately, >> is a c++ operator for writing to the output stream, making this fragment mildly confusing. The c++ standard actually requires one or more blank between two consecutive angle brackets that end a template definition. However, many c++ compilers recognize this detail as one that programmers will routinely overlook. Thus, they correctly handle the case of the missing blank. This confusion can be resolved in the parser by matching the angled brackets with the corresponding opening brackets. The scanner, of course, cannot match the brackets. Recognizing >> as either two closing occurrences of > or as a single operator requires some coordination between the scanner and the parser.
2.11Summary and Perspective
The widespread use of regular expressions for searching and scanning is one of the success stories of modern computer science. These ideas were developed as an early part of the theory of formal languages and automata. They are routinely applied in tools ranging from text editors to compilers as a means of concisely specifying groups of strings (that happen to be regular languages).
Most modern compilers use generated scanners. The properties of deterministic finite automata match quite closely the demands of a compiler. The cost of recognizing a word is proportional to its length. The overhead per character is quite small in a careful implementation. The number of states can be reduced with the widely-used minimization algorithm. Direct-encoding of the states provides a speed boost over a table-driven interpreter. The widely available scanner generators are good enough that hand-implementation can rarely, if ever, be justified.
Questions
1. Consider the following regular expression:
r0 | r00 | r1 | r01 | r2 | r02 | ... |
| r30 | r31 |
Apply the constructions to build
(a)the nfa from the re,
(b)the dfa from the nfa, and
(c)the re from the dfa.
Explain any di erences between the original re and the re that you produced.
How does the dfa that you built compare with the dfa built in the chapter from following re
r ((0 | 1 | 2) (digit | ) | (4 | 5 | 6 | 7 | 8 | 9) | (3 | 30 | 31) ?
52 |
CHAPTER 2. LEXICAL ANALYSIS |
Chapter 3
Parsing
3.1Introduction
The parser’s task is to analyze the input program, as abstracted by the scanner, and determine whether or not the program constitutes a legal sentence in the source language. Like lexical analysis, syntax analysis has been studied extensively. As we shall see, results from the formal treatment of syntax analysis lead to the creation of e cient parsers for large families of languages.
Many techniques have been proposed for parsing. Many tools have been built that largely automate parser construction. In this chapter, we will examine two specific parsing techniques. Both techniques are capable of producing robust, efficient parsers for typical programming languages. Using the first method, called top-down, recursive-descent parsing, we will construct a hand-coded parser in a systematic way. Recursive-descent parsers are typically compact and e cient. The parsing algorithm used is easy to understand and implement. The second method, called bottom-up, LR(1) parsing, uses results from formal language theory to construct a parsing automaton. We will explore how tools can directly generate a parsing automaton and its implementation from a specification of the language’s syntax. Lr(1) parsers are e cient and general; the tools for building lr(1) parsers are widely available for little or no cost.
Many other techniques for building parsers have been explored in practice, in the research literature, and in other textbooks. These include bottom-up parsers like slr(1), lalr(1), and operator precedence, and automated top-down parsers like ll(1) parsers. If you need a detailed explanation of one of these techniques, we suggest that you consult the older textbooks listed in the chapter bibliography for an explanation of how those techniques di er from lr(1).
3.2Expressing Syntax
A parser is, essentially, an engine for determining whether or not the input program is a valid sentence in the source language. To answer this question, we need both a formal mechanism for specifying the syntax of the input language,
53
54 |
CHAPTER 3. PARSING |
and a systematic method of determining membership in the formally-specified language. This section describes one mechanism for expressing syntax: a simple variation on the Backus-Naur form for writing formal grammars. The remainder of the chapter discusses techniques for determining membership in the language described by a formal grammar.
3.2.1Context-Free Grammars
The traditional notation for expressing syntax is a grammar —a collection of rules that define, mathematically, when a string of symbols is actually a sentence in the language.
Computer scientists usually describe the syntactic structure of a language using an abstraction called a context-free grammar (cfg). A cfg, G, is a set of rules that describe how to form sentences; the collection of sentences that can be derived from G is called the language defined by G, and denoted L(G). An example may help. Consider the following grammar, which we call SN :
SheepNoise → SheepNoise baa
|baa
The first rule reads “SheepNoise can derive the string SheepNoise baa,” where SheepNoise is a syntactic variable and baa is a word in the language described by the grammar. The second rule reads “SheepNoise can also derive the string baa.”
To understand the relationship between the SN grammar and L(SN ), we need to specify how to apply the rules in the grammar to derive sentences in L(SN ). To begin, we must identIfy the goal symbol or start symbol of SN . The goal symbol represents the set of all strings in L(SN ). As such, it cannot be one of the words in the language. Instead, it must be one of the syntactic variables introduced to add structure and abstraction to the language. Since SN has only one syntactic variable, SheepNoise must be the goal symbol.
To derive a sentence, we begin with the string consisting of just the goal symbol. Next, we pick a syntactic variable, α, in the string and a rule α → β that has α on its left-hand side. We rewrite the string by replacing the selected occurrence of α with the right-hand side of the rule, β. We repeat this process until the string contains no more syntactic variables; at this point, the string consists entirely of words in the language, or terminal symbols.
At each point in this derivation process, the string is a collection of symbols drawn from the union of the set of syntactic variables and the set of words in the language. A string of syntactic variables and words is considered a sentential form if some valid sentence can be derived from it—that is, if it occurs in some step of a valid derivation. If we begin with SheepNoise and apply successive rewrites using the two rules, at each step in the process the string will be a sentential form. When we have reached the point where the string contains only words in the language (and no syntactic variables), the string is a sentence in L(SN ).

3.2. EXPRESSING SYNTAX |
55 |
For SN , we must begin with the string “SheepNoise.” Using rule two, we can rewrite SheepNoise as baa. Since the sentential form contains only terminal symbols, no further rewrites are possible. Thus, the sentential form “baa” is a valid sentence in the language defined by our grammar. We can represent this derivation in tabular form.
Rule Sentential Form
SheepNoise
2baa
We could also begin with SheepNoise and apply rule one to obtain the sentential form “SheepNoise baa”. Next, we can use rule two to derive the sentence “baa baa”.
Rule Sentential Form
SheepNoise
1 SheepNoise baa
2baa baa
As a notational convenience, we will build on this interpretation of the symbol →; when convenient, we will write →+ to mean “derives in one or more step.” Thus, we might write SheepNoise →+ baa baa.
Of course, we can apply rule one in place of rule two to generate an even longer string of baas. Repeated application of this pattern of rules, in a sequence (rule one) rule two will derive the language consisting of one or more occurrences of the word baa. This corresponds to the set of noises that a sheep makes, under normal circumstances. These derivations all have the same form.
Rule Sentential Form
SheepNoise
1SheepNoise baa
1SheepNoise baa baa
. . . and so on . . .
1SheepNoise baa . . . baa
2baa baa . . . baa
Notice that this language is equivalent to the re baa baa or baa+.
More formally, a grammar G is a four-tuple, G = (T, N T, S, P ), where:
T is the set of terminal symbols, or words, in the language. Terminal symbols are the fundamental units of grammatical sentences. In a compiler, the terminal symbols correspond to words discovered in lexical analysis.
N T is the set of non-terminal symbols, or syntactic variables, that appear in the rules of the grammar. N T consists of all the symbols mentioned in the rules other than those in T . Non-terminal symbols are variables used to provide abstraction and structure in the set of rules.
56 |
CHAPTER 3. PARSING |
S is a designated member of N T called the goal symbol or start symbol. Any derivation of a sentence in G must begin with S. Thus, the language derivable from G (denoted L(G)) consists of exactly the sentences that can be derived starting from S. In other words, S represents the set of valid sentences in L(G).
P is a set of productions or rewrite rules. Formally, P : N T → (T N T ) . Notice that we have restricted the definition so that it allows only a single non-terminal on the lefthand side. This ensures that the grammar is context free.
The rules of P encode the syntactic structure of the grammar.
Notice that we can derive N T , T , and P directly from the grammar rules. For the SN grammar, we can also discover S. In general, discovering the start
symbol is harder. Consider, for example, the grammar: |
|
||
Paren → |
( Bracket ) |
Bracket → |
[ Paren ] |
| |
( ) |
| |
[ ] |
The grammar describes the set of sentences consisting of balanced pairs of alternating parentheses and square brackets. It is not clear, however, if the outermost pair should be parentheses or square brackets. Designating Paren as S forces outermost parentheses. Designating Bracket as S forces outermost square brackets. If the intent is that either can serve as the outermost pair of symbols, we need two additional productions:
Start → Paren
|Bracket
This grammar has a clear and unambiguous goal symbol, Start. Because Start does not appear in the right-hand side of any production, it must be the goal symbol. Some systems that manipulate grammars require that a grammar have a single Start symbol that appears in no production’s right-hand side. They use this property to simplify the process of discovering S. As our example shows, we can always create a unique start symbol by adding one more non-terminal and a few simple productions
3.2.2Constructing Sentences
To explore the power and complexity of context-free grammars, we need a more involved example than SN . Consider the following grammar:
1. |
Expr |
→ |
Expr Op Number |
2. |
|
| |
Number |
3. |
Op |
→ |
+ |
4. |
|
| |
− |
5. |
|
| |
× |
6. |
|
| |
÷ |

3.2. EXPRESSING SYNTAX |
57 |
Digression: Notation for Context-Free Grammars
The traditional notation used by computer scientists to represent a contextfree grammar is called Backus-Naur form, or bnf. Bnf denoted nonterminals by wrapping them in angle brackets, like SheepNoise . Terminal symbols were underlined. The symbol ::= meant “derives”, and the symbol | meant also derives. In bnf, our example grammar SN would be written:
SheepNoise ::= SheepNoise baa
|baa
Bnf has its origins in the late 1950’s and early 1960’s. The syntactic conventions of angle brackets, underlining, ::= and | arose in response to the limited typographic options available to people writing language descriptions.
(For an extreme example, see David Gries’ book Compiler Construction for Digital Computers, which was printed entirely using one of the print trains available on a standard lineprinter.) Throughout this book, we use a slightly updated form of bnf. Non-terminals are written with slanted text. Terminals are written in the typewriter font (and underlined when doing so adds clarity). “Derives” is written with a rightward-pointing arrow.
We have also forsaken the use of * to represent multiply and / to represent divide. We opt for the standard algebraic symbols × and ÷, except in actual program text. The meaning should be clear to the reader.
It defines a set of expressions over Numbers and the four operators +, −, ×, and ÷. Using the grammar as a rewrite system, we can derive a large set of expressions. For example, applying rule 2 produces the trivial expression consisting solely of Number. Using the sequence 1, 3, 2 produces the expression
Number + Number.
Rule Sentential Form
Expr
1 Expr Op Number
3 Expr + Number
2Number + Number
Longer rewrite sequences produce more complex expressions. For example 1, 5, 1, 3, 2 derives the sentence Number + Number × Number.
Rule Sentential Form
Expr
1Expr Op Number
5 Expr × Number
1Expr Op Number × Number 3 Expr + Number × Number
2Number + Number × Number
We can depict this derivation graphically.

58 |
|
|
|
|
|
CHAPTER 3. PARSING |
|
|
|
|
Expr |
||
|
|
|
Q |
|||
|
|
|
|
?QsQ |
||
|
|
|
||||
Expr |
Op Number |
|||||
|
|
|
Q |
|
|
|
+ |
|
|
?QsQ |
|
|
? |
|
|
|||||
Expr |
Op Number |
× |
??
Number +
This derivation tree, or syntax tree, represents each step in the derivation.
So far, our derivations have always expanded the rightmost non-terminal symbol remaining in the string. Other choices are possible; the obvious alternative is to select the leftmost non-terminal for expansion at each point. Using leftmost choices would produce a di erent derivation sequence for the same sentence. For Number + Number × Number, the leftmost derivation would be:
Rule Sentential Form
Expr
1 |
Expr Op Number |
1 |
Expr Op Number Op Number |
2 |
Number Op Number Op Number |
3 |
Number + Number Op Number |
5 |
Number + Number × Number |
This “leftmost” derivation uses the same set of rules as the “rightmost” derivation, but applies them in a di erent order. The corresponding derivation tree looks like:
|
|
|
|
Expr |
||
|
|
|
Q |
|||
|
|
|
|
?QsQ |
||
|
|
|
||||
Expr |
Op Number |
|||||
|
|
|
Q |
|
|
|
+ |
|
|
?QsQ |
|
|
? |
|
|
|||||
Expr |
Op Number |
× |
??
Number +
It is identical to the derivation tree for the rightmost derivation! The tree represents all the rules applied in the derivation, but not their order of application.
We would expect the rightmost (or leftmost) derivation for a given sentence to be unique. If multiple rightmost (or leftmost) derivations exist for some sentence, then, at some point in the derivation, multiple distinct expansions of the rightmost (leftmost) non-terminal lead to the same sentence. This would produce multiple derivations and, possibly, multiple syntax trees—in other words, the sentence would lack a unique derivation.
A grammar G is ambiguous if and only if there exists a sentence in L(G) that has multiple rightmost (or leftmost) derivations. In general, grammatical structure is related to the underlying meaning of the sentence. Ambiguity is often undesirable; if the compiler cannot be sure of the meaning of a sentence, it cannot translate it into a single definitive code sequence.