Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Cooper K.Engineering a compiler

.pdf
Скачиваний:
53
Добавлен:
23.08.2013
Размер:
2.31 Mб
Скачать

2.7. FROM REGULAR EXPRESSION TO SCANNER

39

2.7.3Some final points

Thus far, we have developed the mechanisms to construct a dfa implementation from a single regular expression. To be useful, a compiler’s scanner must recognize all the syntactic categories that appear in the grammar for the source language. What we need, then, is a recognizer that can handle all the res for the language’s micro-syntax. Given the res for the various syntactic categories, r1, r2, r3, . . . , rk, we can construct a single re for the entire collection by forming (r1 | r2 | r3 | . . . | rk).

If we run this re through the entire process, building nfas for the subexpressions, joining them with -transitions, coalescing states, constructing the dfa that simulates the nfa, and turning the dfa into executable code, we get a scanner that recognizes precisely one word. That is, when we invoke it on some input, it will run through the characters one at a time and accept the string if it is in a final state when it exhausts the input. Unfortunately, most real programs contain more than one word. We need to transform either the language or the recognizer.

At the language level, we can insist that each word end with some easily recognizable delimiter, like a blank or a tab. This is deceptively attractive. Taken literally, it would require delimiters surrounding commas, operators such as + and -, and parentheses.

At the recognizer level, we can transform the dfa slightly and change the notion of accepting a string. For each final state, qi, we (1) create a new state qj , (2) remove qi from F and add qj to F , and (3) make the error transition from qi go to qj . When the scanner reaches qi and cannot legally extend the current word, it will take the transition to qj , a final state. As a final issue, we must make the scanner stop, backspace the input by one character, and accept in each new final state. With these modifications, the recognizer will discover the longest legal keyword that is a prefix of the input string.

What about words that match more than one pattern? Because the methods described in this chapter build from a base of non-determinism, we can union together these arbitrary res without worrying about conflicting rules. For example, the specification for an Algol identifier admits all of the reserved keywords of the language. The compiler writer has a choice on handling this situation. The scanner can recognize those keywords as identifiers and look up each identifier in a pre-computed table to discover keywords, or it can include a re for each keyword. This latter case introduces non-determinism; the transformations will handle it correctly. It also introduces a more subtle problem—the final nfa reaches two distinct final states, one recognizing the keyword and the other recognizing the identifier, and is expected to consistently choose the former. To achieve the desired behavior, scanner generators usually o er a mechanism for prioritizing res to resolve such conflicts.

Lex and its descendants prioritize patterns by the order in which they appear in the input file. Thus, placing keyword patterns before the identifier pattern would ensure the desired behavior. The implementation can ensure that the final states for patterns are numbered in a order that corresponds to this priority

40

CHAPTER 2. LEXICAL ANALYSIS

P { F, (Q - F) } while (P is still changing)

T

for each set s P for each α Σ

partition s by α

into s1, s2, s3, . . . sk T T s1, s2, s3, . . . sk

if T = P then P T

Figure 2.9: DFA minimization algorithm

ordering. When the scanner reaches a state representing multiple final states, it uses the action associated with the lowest-numbered final state.

2.8Better Implementations

A straightforward scanner generator would take as input a set of regular expressions, construct the nfa for each re, combine them using -transitions (using the pattern for a|b in Thompson’s construction), and perform the subset construction to create the corresponding dfa. To convert the dfa into an executable program, it would encode the transition function into a table indexed by current state and input character, and plug the table into a fairly standard skeleton scanner, like the one shown in Figure 2.3.

While this path from a collection of regular expressions to a working scanner is a little long, each of the steps is well understood. This is a good example of the kind of tedious process that is well suited to automation by a computer. However, a number of refinements to the automatic construction process can improve the quality of the resulting scanner or speed up the construction.

2.8.1DFA minimization

The nfa to dfa conversion can create a dfa with a large set of states. While this does not increase the number of instructions required to scan a given string, it does increase the memory requirements of the recognizer. On modern computers, the speed of memory accesses often governs the speed of computation. Smaller tables use less space on disk, in ram, and in the processor’s cache. Each of those can be an advantage.

To minimize the size of the dfa, D = (Q, Σ, δ, q0, F ), we need a technique for recognizing when two states are equivalent—that is, they produce the same behavior on any input string. Figure 2.9 shows an algorithm that partitions the states of a dfa into equivalence classes based on their behavior relative to an input string.

2.8. BETTER IMPLEMENTATIONS

41

Because the algorithm must also preserve halting behavior, the algorithm cannot place a final state in the same class as a non-final state. Thus, the initial partitioning step divides Q into two equivalence classes, F and Q − F .

Each iteration of the while loop refines the current partition, P , by splitting apart sets in P based on their outbound transitions. Consider a set p = {qi, qj , qk} in the current partition. Assume that qi, qj , and qk all have transitions on some symbol α Σ, with qx = δ(qi, α), qy = δ(qj , α), and qz = δ(qk, α). If all of qx, qy , and qz are in the same set in the current partition, then qi, qj , and qk should remain in the same set in the new partition. If, on the other hand, qz is in a di erent set than qx and qy , then the algorithm splits p into p1 = {qi, qj } and p2 = {qk}, and puts both p1 and p2 into the new partition. This is the critical step in the algorithm.

When the algorithm halts, the final partition cannot be refined. Thus, for a set s P , the states in s cannot be distinguished by their behavior on an input string. From the partition, we can construct a new dfa by using a single state to represent each set of states in P , and adding the appropriate transitions between these new representative states. For each state s P , the transition out of s on some α Σ must go to a single set t in P ; if this were not the case, the algorithm would have split s into two or more smaller sets.

To construct the new dfa, we simply create a state to represent each p P , and add the appropriate transitions. After that, we need to remove any states not reachable from the entry state, along with any state that has transitions back to itself on every α Σ. (Unless, of course, we want an explicit representation of the error state.) The resulting dfa is minimal; we leave the proof to the interested reader.

This algorithm is another example of a fixed point computation. P is finite; at most, it can contain | Q | elements. The body of the while loop can only increase the size of P ; it splits sets in P but never combines them. The worst case behavior occurs when each state in Q has di erent behavior; in that case, the while loop halts when P has a unique set for each q Q. (This would occur if the algorithm was invoked on a minimal dfa.)

2.8.2Programming Tricks

Explicit State Manipulation Versus Table Lookup The example code in Figure 2.3 uses an explicit variable, state, to hold the current state of the dfa. The while loop tests char against eof, computes a new state, calls action to interpret it, advances the input stream, and branches back to the top of the loop. The implementation spends much of its time manipulating or testing the state (and we have not yet explicitly discussed the expense incurred in the array lookup to implement the transition table or the logic required to support the switch statement (see Chapter 8).

We can avoid much of this overhead by encoding the state information implicitly in the program counter. In this model, each state checks the next character against its transitions, and branches directly to the next state. This creates a program with complex control flow; it resembles nothing as much as a jumbled

42

char next character;

s0: word char ;

char next character; if (char = ’r’) then

goto s1; else goto se;

s1: word word + char; char next character; if (’0’≤char≤’9’) then

goto s2; else goto se;

CHAPTER 2. LEXICAL ANALYSIS

s2: word word + char; char next character; if (’0’≤char≤’9’) then

goto s2;

else if (char = eof) then report acceptance;

else goto se;

se: print error message; return failure;

Figure 2.10: A direct-coded recognizer for “r digit digit

heap of spaghetti. Figure 2.10 shows a version of the skeleton recognizer written in this style. It is both shorter and simpler than the table-driven version. It should be faster, because the overhead per state is lower than in table-lookup version.

Of course, this implementation paradigm violates many of the precepts of structured programming. In a small code, like the example, this style may be comprehensible. As the re specification becomes more complex and generates both more states and more transitions, the added complexity can make it quite di cult to follow. If the code is generated directly from a collection of res, using automatic tools, there is little reason for a human to directly read or debug the scanner code. The additional speed obtained from lower overhead and better memory locality5 makes direct-coding an attractive option.

Hashing Keywords versus Directly Encoding Them The scanner writer must choose how to specify reserved keywords in the source programming language— words like for, while, if, then, and else. These words can be written as regular expressions in the scanner specification, or they can be folded into the set of identifiers and recognized using a table lookup in the actions associated with an identifier.

With a reasonably implemented hash table, the expected case behavior of the two schemes should di er by a constant amount. The dfa requires time proportional to the length of the keyword, and the hash mechanism adds a constant time overhead after recognition.

From an implementation perspective, however, direct coding is simpler. It avoids the need for a separate hash table of reserved words, along with the cost of a hash lookup on every identifier. Direct coding increases the size of the dfa from which the scanner is built. This can make the scanner’s memory requirements larger and might require more code to select the transitions out

5Large tables may have rather poor locality.

2.9. RELATED RESULTS

43

of some states. (The actual impact of these e ects undoubtedly depend on the behavior of the memory hierarchy.)

On the other hand, using a reserved word table also requires both memory and code. With a reserved word table, the cost of recognizing every identifier increases.

Specifying Actions In building a scanner generator, the designer can allow actions on each transition in the dfa or only in the final states of the dfa. This choice has a strong impact on the e ciency of the resulting dfa. Consider, for example, a re that recognizes positive integers with a single leading zero.

0 | (1 | 2 | · · · | 9)(0 | 1 | 2 · · · | 9)

An scanner generator that allows actions only in accepting states will force the user to rescan the string to compute its actual value. Thus, the scanner will step through each character of the already recognized word, performing some appropriate action to convert the text into a decimal value. Worse yet, if the system provides a built-in mechanism for the conversion, the programmer will likely use it, adding the overhead of a procedure call to this simple and frequently executed operation. (On Unix systems, many lex-generated scanners contain an action that invokes sscanf() to perform precisely this function.)

If, however, the scanner generator allows actions on each transition, the compiler writer can implement the ancient assembly-language trick for this conversion. On recognizing an initial digit, the accumulator is set to the value of the recognized digit. On each subsequent digit, the accumulator is multiplied by ten and the new digit added to it. This algorithm avoids touching the character twice; it produces the result quickly and inline using the well-known conversion algorithm; and it eliminates the string manipulation overhead implicit in the first solution. (The scanner likely copies characters from the input bu er into some result string before on each transition in the first scenario.)

In general, the scanner should avoid processing each character multiple times. The more freedom that it allows the compiler writer in the placement of actions, the simpler it becomes to implement e ective and e cient algorithms that avoid copying characters around and examining them several times.

2.9Related Results

Regular expressions and their corresponding automata have been studied for many years. This section explores several related issues. These results do not play a direct role in scanner construction; however, they may be of intellectual interest in the discussion of scanner construction.

2.9.1Eliminating -moves

When we applied Thompson’s construction to the regular expression a(b|c) , the resulting nfa had ten states and twelve transitions. All but three of the transitions are -moves. A typical compiler-construction student would produce a two state dfa with three transitions and no -moves.

44

CHAPTER 2. LEXICAL ANALYSIS

? a,b

s0 a- s1

Eliminating -moves can both shrink and simplify an nfa. While it is not strictly necessary in the process of converting a set of res into a dfa, it can be helpful if humans are to examine the automata at any point in the process.

Some -moves can be easily eliminated. The nfa shown on the left can arise from Thompson’s construction. The source of the -move, state sf , was a final state for some subexpression of the re ending in α; the sink of the -move, state t0, was the initial state of another subexpression beginning with either γ, δ, or

θ.

α -

 

 

γ *

α

 

 

γ *

 

 

-

sf

 

 

t0 H

-

 

sf

 

 

 

 

 

 

β δ jH

 

 

 

 

 

H

 

 

 

 

 

 

 

 

 

 

 

 

 

6

 

 

 

6

 

 

βδ jH

 

 

 

 

 

In this particular case, we can eliminate the -move by combining the two states, sf and t0, into a single state. To accomplish this, we need to make sf the source of each edge leaving t0, and sf the sink of any edge entering t0. This produces the simplified nfa shown on the right. Notice that coalescing can create a state with multiple transitions on the same symbol:

 

α-

 

 

 

,

sj sk

α,

sk

@

 

@

 

,

 

 

,

 

αR@ sm

αR@ sm

si

 

sij

 

 

 

If sk and sm are distinct states, then both sij , sk , α and sij , sm, α should

remain. If sk and sm are the same state, then a single transition will su ce. A more general version of this problem arises if sk and sm are distinct, but the sub-nfas that they begin recognize the same languages. The dfa minimization algorithm should eliminate this latter kind of duplication.

Some -moves can be eliminated by simply coalescing states. Obviously, it works when the -move is the only edge leaving its source state and the only edge entering its sink state. In general, however, the states connected by an-move cannot be directly coalesced. Consider the following modification of the earlier nfa, where we have added one additional edge—a transition from sf to

itself on ϕ.

 

 

 

 

 

 

 

 

 

 

α -

 

 

γ *

α

 

 

γ*

 

 

 

 

-

sf

 

 

t0 H

-

 

sf

 

 

 

 

 

 

 

 

 

β δ Hj

 

 

 

 

 

H

 

 

 

 

ϕ

 

 

 

 

 

 

 

 

 

 

 

6

 

 

6

 

 

 

6

 

 

δHj

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

ϕ, β

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

2.9. RELATED RESULTS

45

for each edge e E if e = qi, qj , then

add e to WorkList

while (WorkList = )

remove e = qi, qj, α from WorkList if α = then

for each qj , qk, β in E add qi, qk, β to E if β = then

add qi, qk, β to WorkList delete qi, qj , from E

if qj F then add qi to F

for each state qi N

if qi has no entering edge then delete qi from N

Figure 2.11: Removing -transitions

The nfa on the right would result from combining the states. Where the original nfa accepted words that contained the substring ϕ β , the new nfa accepts words containing (ϕ | β) . Coalescing the states changed the language!

Figure 2.11 shows an algorithm that eliminates -moves by duplicating transitions. The underlying idea is quite simple. If there exists a transition qi, qj, , it copies each transition leaving qj so that an equivalent transition leaves qi, and then deletes qi, qj, . This has the e ect of eliminating paths of the form α and replacing them with a direct transition on α.

To understand its behavior, let’s apply it to the nfa for a(b|c) shown in Figure 2.6. The first step puts all of the -moves onto a worklist. Next, the algorithm iterates over the worklist, copying edges, deleting -moves, and updating the set of final states, F . Figure 2.12 summarizes the iterations. The left column shows the edge removed from the worklist; the center column shows the transitions added by copying; the right column shows any states added to F . To clarify the algorithm’s behavior, we have removed edges from the worklist in phases. The horizontal lines divide the table into phases. Thus, the first section, from 1,8, to 7,9, contains all the edges put on the worklist initially. The second section includes all edges added during the first phase. The final section includes all edges added during the second phase. Since it adds no additional-moves, the worklist is empty and the algorithm halts.

46

 

CHAPTER 2. LEXICAL ANALYSIS

 

 

 

 

 

 

-move from

 

Add

 

 

WorkList

Adds transitions

to F

 

 

 

 

 

 

 

1,8

1,9, 1,6,

 

 

 

8,6

8,2, 8,4,

 

 

 

8,9

8

 

 

6,2

6,3,b

 

 

 

6,4

6,5,c

 

 

 

3,7

3,6, 3,9,

3

 

 

5,7

5,6, 5,9,

5

 

 

7,6

7,2, 7,4,

 

 

 

7,9

7

 

 

1,9

1

 

 

1,6

1,3,b 1,5,c

 

 

 

8,2

8,3,b

 

 

 

8,4

8,5,c

 

 

 

3,6

3,2, 3,4,

 

 

 

3,9

3

 

 

5,6

5,2, 5,4,

 

 

 

5,9

5

 

 

7,2

7,3,b

 

 

 

7,4

7,5,c

 

 

 

3,2

3,3,b

 

 

 

3,4

3,5,c

 

 

 

5,2

5,3,b

 

 

 

5,4

5,5,c

 

 

Figure 2.12: -removal algorithm applied to a(b|c)

b3 s3 c

s0 a- s1 c 6b

QcQs ?s5 b

The resulting nfa is much simpler than the original. It has four states and seven transitions, none on . Of course, it is still somewhat more complex than the two state, three transition nfa shown earlier. Applying the dfa minimization algorithm would simplify this automaton further.

2.9.2Building a RE from a DFA

In Section 2.7, we showed how to build a dfa from an arbitrary regular expression. This can be viewed as a constructive proof that dfas are at least as powerful as res. In this section, we present a simple algorithm that constructs

2
1,16

2.9. RELATED RESULTS

47

for i = 1 to N for j = 1 to N

R0ij = {a | δ(si , a) = sj } if (i = j) then

 

Rij0 = Rij0 { }

for k = 1 to N

for i = 1 to N

 

for j = 1 to N

 

Rijk = Rikk−1(Rkkk−1) Rkjk−1 Rijk−1

L =

sj F

RN

 

1j

Figure 2.13: From a dfa to a re

a re to describe the set of strings accepted by an arbitrary dfa. It shows that res are at least as powerful as dfas. Taken together, these constructions form the basis of a proof that res are equivalent to dfas.

Consider the diagram of a dfa as a graph with labeled edges. The problem of deriving a re that describes the language accepted by the dfa corresponds to a path problem over the dfa’s transition diagram. The set of strings in L(dfa) consists of the set edge labels for every path from q0 to qi, qi F . For any dfa with a cyclic transition graph, the set of such paths is infinite. Fortunately, the res have the Kleene-closure operator to handle this case and summarize the complete set of sub-paths created by a cycle.

Several techniques can be used to compute this path expression. The algorithm generates an expression that represents the labels along all paths between two nodes, for each pair of nodes in the transition diagram. Then, it unions together the expressions for paths from q0 to qi, qi F. This algorithm, shown in Figure 2.13, systematically constructs the path expressions for all paths. Assume, without loss of generality, that we can number the nodes from 1 to N , with q0 having the number 1.

The algorithm computes a set of expressions, denoted Rkij, for all the relevant values of i, j, and k. Rkij is an expression that describes all paths through the transition graph, from state i to state j without going through a state numbered higher than k. Here, “through” means both entering and leaving, so that R can be non-empty.

Initially, it sets R0ij to contain the labels of all edges that run directly from i to j. Over successive iterations, it builds up longer paths by adding to Rkij1 the paths that actually pass through k on their way from i to j. Given Rkij1, the set of paths added by going from k − 1 to k is exactly the set of paths that run from i to j using no state higher than k − 1, concatenated with the paths from k to itself that pass through no state higher than k − 1, followed by the paths from k to j that pass through no state higher than k − 1. That is, each iteration of the loop on k adds the paths that pass through k to each set Rkij1.

48

CHAPTER 2. LEXICAL ANALYSIS

1.INTEGERFUNCTIONA

2.PARAMETER(A=6,B=2)

3.IMPLICIT CHARACTER*(A-B)(A-B)

4.INTEGER FORMAT(10),IF(10),DO9E1

5.100 FORMAT(4H)=(3)

6.200 FORMAT(4 )=(3)

7.DO9E1=1

8.DO9E1=1,2

9.9 IF(X)=1

10.IF(X)H=1

11.IF(X)300,200

12.300 END

13.C this is a comment

14.$FILE(1)

15.END

Figure 2.14: Scanning Fortran

When the k-loop terminates, the various Rkij expressions account for all paths in the transition graph. Now, we must compute the set of paths that being in state 1 and end in some final state, sj F .

2.10Lexical Follies of Real Programming languages

This chapter has dealt, largely, with the theory of specifying and automatically generating scanners. Most modern programming languages have a simple lexical structure. In fact, the development of a sound theoretical basis for scanning probably influenced language design in a positive way. Nonetheless, lexical difficulties do arise in the design of programming languages. This section presents several examples.

To see how di cult scanning can be, consider the example Fortran fragment shown in Figure 2.14. (The example is due to Dr. F.K. Zadeck.) In Fortran 66 (and Fortran 77), blanks are not significant—the scanner ignores them. Identifiers are limited to six characters, and the language relies on this property to make some constructs recognizable.

In line 1, we find a declaration of A as an integer function. To break this into words, the scanner must read INTEGE and notice that the next character, R, is neither an open parenthesis, as in INTEGE(10) = J nor an assignment operator, as in INTEGE = J This fact, combined with the six character limit on identifiers, lets the scanner understand that INTEGE is the start of the reserved keyword INTEGER. The next reserved keyword, FUNCTION requires application of the same six character limit. After recognizing that FUNCTIONA has too many characters to be an INTEGER variable, the scanner can conclude that it has three words on the first line, INTEGER, FUNCTION, and A.

The second line declares A as a PARAMETER that is macro-expanded to 6 when

Соседние файлы в предмете Электротехника