Добавил:

Andrey Опубликованный материал нарушает ваши авторские права? Сообщите нам.

Вуз:

Санкт-Петербургский государственный электротехнический университет "ЛЭТИ"

Предмет:

Электротехника

Файл:

Cooper K.Engineering a compiler

.pdf

Скачиваний:

Добавлен:

23.08.2013

Размер:

2.31 Mб

Скачать

☆

<<< < Предыдущая 1 2 3 4 5 67 / 367 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 > Следующая >>>

2.10. LEXICAL FOLLIES OF REAL PROGRAMMING LANGUAGES 49

Digression: The Hazards of Bad Lexical Design

An apocryphal story has long circulated in the compiler construction community. It suggests than an early Nasa mission to Mars (or Venus, or the Moon,) crashed because of a missing comma in the Fortran code for a do loop. Of course, the body of the loop would have executed once, rather than the intended number of times. While we doubt the truth of the story, it has achieved the status of an “urban legend” among compiler writers.

it occurs as a word on its own. Similarly, B expands to 2. Again, the scanner must rely on the six character limit.

With the parameters expanded, line three scans as IMPLICIT CHARACTER*4 (A-B). It tells the compiler that any variable beginning with the letters A or B has the data type of a four-character string. Of course, the six character limit makes it possible for the scanner to recognize IMPLICIT.

Line four has no new lexical complexity. It declares INTEGER arrays of ten element named FORMAT and IF, and a scalar INTEGER variable named DO9E1.

Line ﬁve begins with a statement label, 100. Since Fortran was designed for punch cards, it has a ﬁxed-ﬁeld format for each line. Columns 1 through 5 are reserved for statement labels; a C in column 1 indicates that the entire line is a comment. Column 6 is empty, unless the line is a continuation of the previous line. The remainder of line ﬁve is a FORMAT statement. The notation 4H)=(3 is a “Hollerith constant.” 4H indicates that the following four characters form a literal constant. Thus, the entire line scans as:

label, 100 , format keyword , ’(’ constant, ‘‘)=(3’’ ’)’

This is a FORMAT statement, used to specify the way that characters are read or written in a READ, WRITE, or PRINT statement.

Line six is an assignment of the value 3 to the fourth element of the INTEGER array FORMAT, declared back on line 4. To distinguish between the variable and the keyword, the scanner must read past the (4 ) to reach the equals sign. Since the equals sign indicates an assignment statement, the text to its left must be a reference to a variable. Thus, FORMAT is the variable rather than the keyword. Of course, adding the H from line 5 would change that interpretation.

Line 7 assigns the value 1 to the variable DO9E1, while line 8 marks the beginning of a DO loop that ends at label 9 and uses the induction variable E1. The di erence between these lines lies in the comma following =1. The scanner cannot decide whether DO9E1 is a variable or the sequence keyword,DO ,label,9 , variable,E1 until it reaches either the comma or the end of the line.

The next three lines look quite similar, but scan di erently. The ﬁrst is an assignment of the value 1 to the Xth element of the array IF declared on line 4. The second is an IF statement that assigns 1 to H if X evaluates to true. The third branches to either label 200 or 300, depending on the relationship between the value of X and zero. In each case, the scanner must proceed well beyond the IF before it can classify IF as a variable or a keyword.

50	CHAPTER 2. LEXICAL ANALYSIS
	The ﬁnal complication begins on line 12. Taken by itself, the line appears

to be an END statement, which usually appears at the end of a procedure. It is followed, on line 13, by a comment. The comment is trivially recognized by of the character in column 1. However, line 14 is a continuation of the previous statement, on line 12. To see this, the scanner must read line 13, discover that it is a comment, and read line 13 to discover the $ in column 6. At this point, it ﬁnds the string FILE. Since blanks (and intervening comment cards) are not signiﬁcant, the word on line 12 is actually ENDFILE, split across an internal comment. Thus, lines 12 and 14 form an ENDFILE statement that marks the ﬁle designated as 1 as ﬁnished.

The last line, 15, is truly an END statement.

To scan this simple piece of Fortran text, the scanner needed to look arbitrarily far ahead in the text—limited only by the end of the statement. In the process, it applied idiosyncratic rules related to identiﬁer length, to the placement of symbols like commas and equal signs. It had to read to the end of some statements to categorize the initial word of a line.

While these problems in Fortran are the result of language design from the late 1950’s, more modern languages have their own occasional lexical lapses. For example, pl/i, designed a decade later, discarded the notion of reserved keywords. Thus, the programmer could use words like if, then, and while as variable names. Rampant and tasteless use of that “feature” led to several examples of lexical confusion.

if then then then = else; else else = then;

This code fragment is an if-then-else construct that controls assignments between two variables named then and else. The choice between the then-part and the else-part is based on an expression consisting of a single reference to the variable then. It is unclear why anyone would want to write this fragment.

More di cult, from a lexical perspective, is the following set of statements.

declare (a1,a2,a3,a4) fixed binary; declare (a1,a2,a3,a4) = 2;

The ﬁrst declares four integer variables, named a1, a2, a3, and a4. The second is an assignment to an element of a four-dimensional array named declare.

(It presupposes the existence of a declaration for the array.) This example exhibits a Fortran-like problem. The compiler must scan to = before discovering whether declare is a keyword or an identiﬁer. Since pl/i places no limit on the comma-separated list’s length, the scanner must examine an arbitrary amount of right context before it can determine the syntactic category for declare. This complicates the problem of bu ering the input in the scanner.

As a ﬁnal example, consider the syntax of c++, a language designed in the late 1980s. The template syntax of c++ allows the fragment

PriorityQueue<MyType>

If MyType is itself a template, this can lead to the fragment

2.11. SUMMARY AND PERSPECTIVE

PriorityQueue<MyType<int>>

which seems straight forward to scan. Unfortunately, >> is a c++ operator for writing to the output stream, making this fragment mildly confusing. The c++ standard actually requires one or more blank between two consecutive angle brackets that end a template deﬁnition. However, many c++ compilers recognize this detail as one that programmers will routinely overlook. Thus, they correctly handle the case of the missing blank. This confusion can be resolved in the parser by matching the angled brackets with the corresponding opening brackets. The scanner, of course, cannot match the brackets. Recognizing >> as either two closing occurrences of > or as a single operator requires some coordination between the scanner and the parser.

2.11Summary and Perspective

The widespread use of regular expressions for searching and scanning is one of the success stories of modern computer science. These ideas were developed as an early part of the theory of formal languages and automata. They are routinely applied in tools ranging from text editors to compilers as a means of concisely specifying groups of strings (that happen to be regular languages).

Most modern compilers use generated scanners. The properties of deterministic ﬁnite automata match quite closely the demands of a compiler. The cost of recognizing a word is proportional to its length. The overhead per character is quite small in a careful implementation. The number of states can be reduced with the widely-used minimization algorithm. Direct-encoding of the states provides a speed boost over a table-driven interpreter. The widely available scanner generators are good enough that hand-implementation can rarely, if ever, be justiﬁed.

Questions

1. Consider the following regular expression:

r0 | r00 | r1 | r01 | r2 | r02 | ...

| r30 | r31

Apply the constructions to build

(a)the nfa from the re,

(b)the dfa from the nfa, and

(c)the re from the dfa.

Explain any di erences between the original re and the re that you produced.

How does the dfa that you built compare with the dfa built in the chapter from following re

r ((0 | 1 | 2) (digit | ) | (4 | 5 | 6 | 7 | 8 | 9) | (3 | 30 | 31) ?

52	CHAPTER 2. LEXICAL ANALYSIS

Chapter 3

Parsing

3.1Introduction

The parser’s task is to analyze the input program, as abstracted by the scanner, and determine whether or not the program constitutes a legal sentence in the source language. Like lexical analysis, syntax analysis has been studied extensively. As we shall see, results from the formal treatment of syntax analysis lead to the creation of e cient parsers for large families of languages.

Many techniques have been proposed for parsing. Many tools have been built that largely automate parser construction. In this chapter, we will examine two speciﬁc parsing techniques. Both techniques are capable of producing robust, efﬁcient parsers for typical programming languages. Using the ﬁrst method, called top-down, recursive-descent parsing, we will construct a hand-coded parser in a systematic way. Recursive-descent parsers are typically compact and e cient. The parsing algorithm used is easy to understand and implement. The second method, called bottom-up, LR(1) parsing, uses results from formal language theory to construct a parsing automaton. We will explore how tools can directly generate a parsing automaton and its implementation from a speciﬁcation of the language’s syntax. Lr(1) parsers are e cient and general; the tools for building lr(1) parsers are widely available for little or no cost.

Many other techniques for building parsers have been explored in practice, in the research literature, and in other textbooks. These include bottom-up parsers like slr(1), lalr(1), and operator precedence, and automated top-down parsers like ll(1) parsers. If you need a detailed explanation of one of these techniques, we suggest that you consult the older textbooks listed in the chapter bibliography for an explanation of how those techniques di er from lr(1).

3.2Expressing Syntax

A parser is, essentially, an engine for determining whether or not the input program is a valid sentence in the source language. To answer this question, we need both a formal mechanism for specifying the syntax of the input language,

54	CHAPTER 3. PARSING

and a systematic method of determining membership in the formally-speciﬁed language. This section describes one mechanism for expressing syntax: a simple variation on the Backus-Naur form for writing formal grammars. The remainder of the chapter discusses techniques for determining membership in the language described by a formal grammar.

3.2.1Context-Free Grammars

The traditional notation for expressing syntax is a grammar —a collection of rules that deﬁne, mathematically, when a string of symbols is actually a sentence in the language.

Computer scientists usually describe the syntactic structure of a language using an abstraction called a context-free grammar (cfg). A cfg, G, is a set of rules that describe how to form sentences; the collection of sentences that can be derived from G is called the language deﬁned by G, and denoted L(G). An example may help. Consider the following grammar, which we call SN :

SheepNoise → SheepNoise baa

|baa

The ﬁrst rule reads “SheepNoise can derive the string SheepNoise baa,” where SheepNoise is a syntactic variable and baa is a word in the language described by the grammar. The second rule reads “SheepNoise can also derive the string baa.”

To understand the relationship between the SN grammar and L(SN ), we need to specify how to apply the rules in the grammar to derive sentences in L(SN ). To begin, we must identIfy the goal symbol or start symbol of SN . The goal symbol represents the set of all strings in L(SN ). As such, it cannot be one of the words in the language. Instead, it must be one of the syntactic variables introduced to add structure and abstraction to the language. Since SN has only one syntactic variable, SheepNoise must be the goal symbol.

To derive a sentence, we begin with the string consisting of just the goal symbol. Next, we pick a syntactic variable, α, in the string and a rule α → β that has α on its left-hand side. We rewrite the string by replacing the selected occurrence of α with the right-hand side of the rule, β. We repeat this process until the string contains no more syntactic variables; at this point, the string consists entirely of words in the language, or terminal symbols.

At each point in this derivation process, the string is a collection of symbols drawn from the union of the set of syntactic variables and the set of words in the language. A string of syntactic variables and words is considered a sentential form if some valid sentence can be derived from it—that is, if it occurs in some step of a valid derivation. If we begin with SheepNoise and apply successive rewrites using the two rules, at each step in the process the string will be a sentential form. When we have reached the point where the string contains only words in the language (and no syntactic variables), the string is a sentence in L(SN ).

3.2. EXPRESSING SYNTAX

For SN , we must begin with the string “SheepNoise.” Using rule two, we can rewrite SheepNoise as baa. Since the sentential form contains only terminal symbols, no further rewrites are possible. Thus, the sentential form “baa” is a valid sentence in the language deﬁned by our grammar. We can represent this derivation in tabular form.

Rule Sentential Form

SheepNoise

2baa

We could also begin with SheepNoise and apply rule one to obtain the sentential form “SheepNoise baa”. Next, we can use rule two to derive the sentence “baa baa”.

Rule Sentential Form

SheepNoise

1 SheepNoise baa

2baa baa

As a notational convenience, we will build on this interpretation of the symbol →; when convenient, we will write →+ to mean “derives in one or more step.” Thus, we might write SheepNoise →+ baa baa.

Of course, we can apply rule one in place of rule two to generate an even longer string of baas. Repeated application of this pattern of rules, in a sequence (rule one) rule two will derive the language consisting of one or more occurrences of the word baa. This corresponds to the set of noises that a sheep makes, under normal circumstances. These derivations all have the same form.

Rule Sentential Form

SheepNoise

1SheepNoise baa

1SheepNoise baa baa

. . . and so on . . .

1SheepNoise baa . . . baa

2baa baa . . . baa

Notice that this language is equivalent to the re baa baa or baa+.

More formally, a grammar G is a four-tuple, G = (T, N T, S, P ), where:

T is the set of terminal symbols, or words, in the language. Terminal symbols are the fundamental units of grammatical sentences. In a compiler, the terminal symbols correspond to words discovered in lexical analysis.

N T is the set of non-terminal symbols, or syntactic variables, that appear in the rules of the grammar. N T consists of all the symbols mentioned in the rules other than those in T . Non-terminal symbols are variables used to provide abstraction and structure in the set of rules.

56	CHAPTER 3. PARSING

S is a designated member of N T called the goal symbol or start symbol. Any derivation of a sentence in G must begin with S. Thus, the language derivable from G (denoted L(G)) consists of exactly the sentences that can be derived starting from S. In other words, S represents the set of valid sentences in L(G).

P is a set of productions or rewrite rules. Formally, P : N T → (T N T ) . Notice that we have restricted the deﬁnition so that it allows only a single non-terminal on the lefthand side. This ensures that the grammar is context free.

The rules of P encode the syntactic structure of the grammar.

Notice that we can derive N T , T , and P directly from the grammar rules. For the SN grammar, we can also discover S. In general, discovering the start

symbol is harder. Consider, for example, the grammar:
Paren →	( Bracket )	Bracket →	[ Paren ]
\|	( )	\|	[ ]

The grammar describes the set of sentences consisting of balanced pairs of alternating parentheses and square brackets. It is not clear, however, if the outermost pair should be parentheses or square brackets. Designating Paren as S forces outermost parentheses. Designating Bracket as S forces outermost square brackets. If the intent is that either can serve as the outermost pair of symbols, we need two additional productions:

Start → Paren

|Bracket

This grammar has a clear and unambiguous goal symbol, Start. Because Start does not appear in the right-hand side of any production, it must be the goal symbol. Some systems that manipulate grammars require that a grammar have a single Start symbol that appears in no production’s right-hand side. They use this property to simplify the process of discovering S. As our example shows, we can always create a unique start symbol by adding one more non-terminal and a few simple productions

3.2.2Constructing Sentences

To explore the power and complexity of context-free grammars, we need a more involved example than SN . Consider the following grammar:

1.	Expr	→	Expr Op Number
2.		\|	Number
3.	Op	→	+
4.		\|	−
5.		\|	×
6.		\|	÷

3.2. EXPRESSING SYNTAX

Digression: Notation for Context-Free Grammars

The traditional notation used by computer scientists to represent a contextfree grammar is called Backus-Naur form, or bnf. Bnf denoted nonterminals by wrapping them in angle brackets, like SheepNoise . Terminal symbols were underlined. The symbol ::= meant “derives”, and the symbol | meant also derives. In bnf, our example grammar SN would be written:

SheepNoise ::= SheepNoise baa

|baa

Bnf has its origins in the late 1950’s and early 1960’s. The syntactic conventions of angle brackets, underlining, ::= and | arose in response to the limited typographic options available to people writing language descriptions.

(For an extreme example, see David Gries’ book Compiler Construction for Digital Computers, which was printed entirely using one of the print trains available on a standard lineprinter.) Throughout this book, we use a slightly updated form of bnf. Non-terminals are written with slanted text. Terminals are written in the typewriter font (and underlined when doing so adds clarity). “Derives” is written with a rightward-pointing arrow.

We have also forsaken the use of * to represent multiply and / to represent divide. We opt for the standard algebraic symbols × and ÷, except in actual program text. The meaning should be clear to the reader.

It deﬁnes a set of expressions over Numbers and the four operators +, −, ×, and ÷. Using the grammar as a rewrite system, we can derive a large set of expressions. For example, applying rule 2 produces the trivial expression consisting solely of Number. Using the sequence 1, 3, 2 produces the expression

Number + Number.

Rule Sentential Form

Expr

1 Expr Op Number

3 Expr + Number

2Number + Number

Longer rewrite sequences produce more complex expressions. For example 1, 5, 1, 3, 2 derives the sentence Number + Number × Number.

Rule Sentential Form

Expr

1Expr Op Number

5 Expr × Number

1Expr Op Number × Number 3 Expr + Number × Number

2Number + Number × Number

We can depict this derivation graphically.

58				CHAPTER 3. PARSING
			Expr
				Q
				?QsQ
				?QsQ
Expr			Op Number
		Q
+		?QsQ		?
+		?QsQ		?
Expr	Op Number		×

Number +

This derivation tree, or syntax tree, represents each step in the derivation.

So far, our derivations have always expanded the rightmost non-terminal symbol remaining in the string. Other choices are possible; the obvious alternative is to select the leftmost non-terminal for expansion at each point. Using leftmost choices would produce a di erent derivation sequence for the same sentence. For Number + Number × Number, the leftmost derivation would be:

Rule Sentential Form

Expr

1	Expr Op Number
1	Expr Op Number Op Number
2	Number Op Number Op Number
3	Number + Number Op Number
5	Number + Number × Number

This “leftmost” derivation uses the same set of rules as the “rightmost” derivation, but applies them in a di erent order. The corresponding derivation tree looks like:

			Expr
				Q
				?QsQ

Expr			Op Number
		Q
+		?QsQ		?

Expr	Op Number		×

Number +

It is identical to the derivation tree for the rightmost derivation! The tree represents all the rules applied in the derivation, but not their order of application.

We would expect the rightmost (or leftmost) derivation for a given sentence to be unique. If multiple rightmost (or leftmost) derivations exist for some sentence, then, at some point in the derivation, multiple distinct expansions of the rightmost (leftmost) non-terminal lead to the same sentence. This would produce multiple derivations and, possibly, multiple syntax trees—in other words, the sentence would lack a unique derivation.

A grammar G is ambiguous if and only if there exists a sentence in L(G) that has multiple rightmost (or leftmost) derivations. In general, grammatical structure is related to the underlying meaning of the sentence. Ambiguity is often undesirable; if the compiler cannot be sure of the meaning of a sentence, it cannot translate it into a single deﬁnitive code sequence.

<<< < Предыдущая 1 2 3 4 5 67 / 367 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 > Следующая >>>

Соседние файлы в предмете Электротехника

#
23.08.2013168.53 Кб8CompactPCI specification short form.Rev 2.1.1997.pdf
#
23.08.20132.86 Mб16Computer arithmetic lectures.pdf
#
23.08.20132.04 Mб325Conklin E.K.Forth programmer's handbook.2000.pdf
#
23.08.201384.88 Кб12Constantinides C.AOP considered harmful.pdf
#
23.08.201357.95 Кб133Construction and use of broadband transformers.pdf
#
23.08.20132.31 Mб51Cooper K.Engineering a compiler.pdf
#
23.08.20134.82 Mб22Cooper M.Advanced Bash scripting guide Rev1.4.2002.pdf
#
23.08.2013916.67 Кб15Cooper M.Advanced bash-scripting guide.2002.pdf
#
23.08.201356.71 Кб15Cortes L.Designing a graphical user interface.1997.pdf
#
23.08.2013642.66 Кб23Coulson D.Mastering IPTables.pdf
#
23.08.201329.44 Кб5cover.pdf