Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Cooper K.Engineering a compiler

.pdf
Скачиваний:
53
Добавлен:
23.08.2013
Размер:
2.31 Mб
Скачать

4.4. AD-HOC SYNTAX-DIRECTED TRANSLATION

119

Naming Values With this stack-based communication mechanism, the compiler writer needs a mechanism for naming the stack locations corresponding to symbols in the production’s right-hand side. Yacc introduced a concise notation to address these problems. The symbol $$ refers to the result location for the current production. Thus, the assignment $$ = 17; would push the integer value seventeen as the result corresponding to the current reduction. For the right-hand side, the symbols $1, $2, . . . , $n refer to the locations for the first, second, and nth symbols in the right-hand side, respectively. These symbols translate directly into o sets from the top of the stack. $1 becomes 3 × | β | slots below the top of the stack, while $4 becomes 3 × (| β | −4 + 1) slots from the top of the stack. This simple, natural notation allows the action snippets to read and write the stack locations directly.

4.4.2Back to the Example

To understand how ad-hoc syntax-directed translation works, consider rewriting the execution-time estimator using this approach. The primary drawback of the attribute grammar solution lies in the proliferation of rules to copy information around the tree. This creates many additional rules in the specification. It also creates many copies of the sets. Even a careful implementation that stores pointers to a single copy of each set instance must create new sets whenever an identifier’s name is added to the set.

To address these problems in an ad hoc syntax-directed translation scheme, the compiler writer can introduce a central repository for information about variables, as suggested earlier. For example, the compiler can create a hash table that contains a record for each Ident in the code. If the compiler writer sets aside a field in the table, named InRegister, then the entire copy problem can be avoided. When the table is initialized, the InRegister field is set to false. The code for the production Factor Ident checks the InRegister field and selects the appropriate cost for the reference to Ident. The code would look something like:

i = hash(Ident);

if (Table[i].Loaded = true) then

cost = cost + Cost(load); Table[i].Loaded = true;

Because the compiler writer can use arbitrary constructs, the cost can be accumulated into a single variable, rather than being passed around the parse tree. The resulting set of actions is somewhat smaller than the attribution rules for the simplest execution model, even though it can provide the accuracy of the more complex model. Figure 4.6 shows the full code for an ad hoc version of the example shown in Figures 4.4 and 4.5.

In the ad hoc version, several productions have no action. The remaining actions are quite simple, except for the action taken on reduction by Ident. All of the complication introduced by tracking loads falls into that single action;

120

 

CHAPTER 4. CONTEXT-SENSITIVE ANALYSIS

 

Production

 

Syntax-directed Actions

 

 

Block0

Block1 Assign

 

 

 

 

|

Assign

 

{ cost = 0; }

Assign

Ident = Expr ;

 

{ cost = cost + Cost(store) }

Expr0

Expr1 + Term

 

{ cost = cost + Cost(add); }

 

|

Expr1 − Term

 

{ cost = cost + cost(sub); }

 

|

Term

 

 

 

Term0

Term1 × Factor

 

{ cost = cost + Cost(mult); }

 

|

Term1 ÷ Factor

 

{ cost = cost + Cost(div); }

 

|

Factor

 

 

 

Factor

( Expr )

 

 

 

 

|

Number

 

{ cost = cost + Cost(loadI); }

 

|

Ident

 

{ i = hash(Ident);

 

 

 

 

if (Table[i].Loaded = true)

 

 

 

 

then

 

 

 

 

cost = cost + Cost(load);

 

 

 

 

Table[i].Loaded = true; }

Figure 4.6: Tracking loads with ad hoc syntax-directed translation

contrast that with the attribute grammar version, where the task of passing around the Before and After sets came to dominate the specification. Because it can accumulate cost into a single variable and use a hash table to store global information, the ad hoc version is much cleaner and simpler. Of course, these same strategies could be applied in an attribute grammar framework, but doing so violates the spirit of the attribute grammar paradigm and forces all of the work outside the framework into an ad hoc setting.

4.4.3Uses for Syntax-directed Translation

Compilers use syntax-directed translation schemes to perform many di erent tasks. By associating syntax-directed actions with the productions for sourcelanguage declarations, the compiler can accumulate information about the variables in a program and record it in a central repository—often called a symbol table. As part of the translation process, the compiler must discover the answer to many context-sensitive questions, similar to those mentioned in Section 4.1. Many, if not all, of these questions can be answered by placing the appropriate code in syntax-directed actions. The parser can build an ir for use by

4.4. AD-HOC SYNTAX-DIRECTED TRANSLATION

121

the rest of the compiler by systematic use of syntax-directed actions. Finally, syntax-directed translation can be used to perform complex analysis and transformations. To understand the varied uses of syntax-directed actions, we will examine several di erent applications of ad hoc syntax-directed translation.

Building a Symbol Table To centralize information about variables, labels, and procedures, most compilers construct a symbol table for the input program. The compiler writer can use syntax-directed actions to gather the information, insert that information into the symbol table, and perform any necessary processing. For example, the grammar fragment shown in Figure 4.7 describes a subset of the syntax for declaring variables in c. (It omits typedefs, structs, unions, the type qualifiers const and volatile, and the initialization syntax; it also leaves several non-terminals unelaborated.) Consider the actions required to build symbol table entries for each declared variable.

Each Declaration begins with a set of one or more qualifiers that specify either the variable’s type, its storage class, or both. The qualifiers are followed by a list of one or more variable names; the variable name can include a specification about indirection (one or more occurrences of *), about array dimensions, and about initial values for the variable. To build symbol table entries for each variable, the compiler writer can gather up the attributes from the qualifiers, add any indirection, dimension, or initialization attributes, and enter the variable in the table.

For example, to track storage classes, the parser might include a variable StorageClass, initializing it to a value “none.” Each production that reduced to StorageClass would would set StorageClass to an appropriate value. The language definition allows either zero or one storage class specifier per declaration, even though the context-free grammar admits an arbitrary number. The syntax-directed action can easily check this condition. Thus, the following code might execute on a reduction by StorageClass register:

if (StorageClass = none) then StorageClass = auto else report the error

Similar actions set StorageClass for the reductions by static, extern, and register. In the reduction for DirectDeclarator Identifier, the action creates a new symbol table entry, and uses StorageClass to set the appropriate field in the symbol table. To complete the process, the action for the production

Declaration SpecifierList InitDeclaratorList needs to reset StorageClass to none.

Following these lines, the compiler writer can arrange to record all of the attributes for each variable. In the reduction to Identifier, this information can be written into the symbol table. Care must be taken to initialize and reset the attributes in the appropriate place—for example, the attribute set by the Pointer reduction should be reset for each InitDeclarator. The action routines can check for valid and invalid TypeSpecifier combinations, such as signed char (legal) and double char (illegal).

122

CHAPTER 4. CONTEXT-SENSITIVE ANALYSIS

DeclarationList

DeclarationList Declaration

 

|

Declaration

Declaration

SpecifierList InitDeclaratorList ;

SpecifierList

Specifier SpecifierList

 

|

Specifier

Specifier

StorageClass

 

|

TypeSpecifier

StorageClass

auto

 

|

static

 

|

extern

 

|

register

TypeSpecifier

char

 

|

short

 

|

int

 

|

long

 

|

unsigned

 

|

float

 

|

double

InitDeclaratorList

InitDeclaratorList, InitDeclarator

 

|

InitDeclarator

InitDeclarator

Declarator = Initializer

 

|

Declarator

Declarator

Pointer DirectDeclarator

 

|

DirectDeclarator

Pointer

*

 

|

* Pointer

DirectDeclarator Identifier

|( Declarator )

|DirectDeclarator ( )

|DirectDeclarator ( ParameterTypeList )

| DirectDeclarator ( IdentifierList )

|DirectDeclarator [ ]

|DirectDeclarator [ ConstantExpr ]

Figure 4.7: A subset of c’s declaration syntax

4.4. AD-HOC SYNTAX-DIRECTED TRANSLATION

123

When the parser finishes building the DeclarationList, it has symbol table entries for each variable declared in the current scope. At that point, the compiler may need to perform some housekeeping chores, such as assigning storage locations to declared variables. This can be done in an action for the production that leads to the DeclarationList. (That production has DeclarationList on its right-hand side, but not on its left-hand side.)

Building a Parse Tree Another major task that parsers often perform is building an intermediate representation for use by the rest of the compiler. Building a parse tree through syntax-directed actions is easy. Each time the parser reduces a production A βγδ, it should construct a node for A and make the nodes for β, γ, and δ children of A, in order. To accomplish this, it can push pointers to the appropriate nodes onto the stack.

In this scheme, each reduction creates a node to represent the non-terminal on its left-hand side. The node has a child for each grammar symbol on the right-hand side of the production. The syntax-directed action creates the new node, uses the pointers stored on the stack to connect the node to its children, and pushes the new node’s address as its result.

As an example, consider parsing x 2 × y with the classic expression grammar. It produces the following parse tree:

Goal

 

 

 

 

?

 

 

 

 

 

Expr

 

 

 

 

 

 

XX

 

 

 

 

?

XX

Expr

zXTerm

 

 

 

 

PP

 

 

 

 

?

 

 

 

 

 

? qP

Term

 

 

 

Term × Factor

?? ?

Factor

Factor

Id

 

 

?

 

 

?

 

 

Id

Num

 

Of course, the parse tree is large relative to its information content. The compiler writer might, instead, opt for an abstract syntax tree that retains the essential elements of the parse tree, but gets rid of internal nodes that add nothing to our understanding of the underlying code (see § 6.3.2).

To build an abstract syntax tree, the parser follows the same general scheme as for a parse tree. However, it only builds the desired nodes. For a production like AB, the action returns as its result the pointer that corresponds to B. This eliminates many of the interior nodes. To further simplify the tree, the compiler writer can build a single node for a construct such as the if–then– else, rather than individual nodes for the if, the then, and the else.

An ast for x 2 × y is much smaller than the corresponding parse tree:

Id HHjH× Num HHjHId

124

 

 

CHAPTER 4. CONTEXT-SENSITIVE ANALYSIS

 

Grammar

 

Actions

 

 

Grammar

 

Actions

 

 

 

 

 

 

 

List List Elt

 

$$ L($1,$2);

 

 

List Elt List

 

$$ L($1,$2);

 

| Elt

 

$$ $1;

 

 

| Elt

 

$$ $1;

 

 

 

@R

 

 

,

 

 

 

 

 

,

 

,@

 

 

 

 

 

 

, Elt5

 

 

Elt1 @

 

 

 

 

 

 

,@R

 

 

,@

 

 

, Elt4

 

 

Elt2

@

 

 

,@R

 

 

 

,@

,

Elt3

 

 

Elt3 @@

 

,@R

 

 

 

 

 

 

,R@

 

Elt1 Elt2

 

 

 

 

 

 

Elt4 Elt5

 

Left Recursion

 

 

Right Recursion

Figure 4.8: Recursion versus Associativity

Changing Associativity As we saw in Section 3.6.3, associativity can make a di erence in numerical computation. Similarly, it can change the way that data structures are built. We can use syntax-directed actions to build representations that reflect a di erent associativity than the grammar would naturally produce.

In general, left recursive grammars naturally produce left-associativity, while right-recursive grammars naturally produce right associativity. To see this consider the left-recursive and right-recursive list grammars, augmented with syntax-directed actions to build lists, shown at the top of Figure 4.8. Assume that L(x,y) is a constructor that returns a new node with x and y as its children. The lower part of the figure shows the result of applying the two translation schemes to an input consisting of five Elts.

The two trees are, in many ways, equivalent. An in-order traversal of both trees visits the leaf nodes in the same order. However, the tree produced from the left recursive version is strangely counter-intuitive. If we add parentheses to reflect the tree structure, the left recursive tree is ((((Elt1,Elt2),Elt3,)Elt4), Elt5) while the right recursive tree is (Elt1,(Elt2,(Elt3,(Elt4, Elt5)))). The ordering produced by left recursion corresponds to the classic left-to-right ordering for algebraic operators. The ordering produced by right recursion corresponds to the notion of a list as introduced in programming languages like Lisp and Scheme.

Sometimes, it is convenient to use di erent directions for recursion and associativity. To build the right-recursive tree from the left recursive grammar, we could use a constructor that adds successive elements to the end of the list. A straightforward implementation of this idea would have to walk the list on each reduction, making the constructor itself take O(n2) time, where n is the length of the list. Another classic solution to this problem uses a list-header node that contains pointers to both the first and last node in the list. This introduces an extra node to the list. If the system constructs many short lists, the overhead

4.4. AD-HOC SYNTAX-DIRECTED TRANSLATION

125

may be a problem.

A solution that we find particularly appealing is to use a list header node during construction, and to discard it when the list is fully built. Rewriting the grammar to use an -production makes this particularly clean:

 

Grammar

Actions

List

 

{$$ MakeListHeader();}

 

|

List Elt

{$$ AddToEnd($1,$2);}

Quux

List

{$$ RemoveListHeader($1);}

A reduction with the -production creates the temporary list header node; with a shift-reduce parser, this reduction occurs first. On a reduction for List List Elt, the action invokes a constructor that relies on the presence of the temporary header node. When List is reduced on the right-hand side of any other production, the corresponding action invokes a function that discards the temporary header and returns the first element of the list. This solution lets the parser reverse the associativity at the cost of a small constant overhead in both space and time. It requires one more reduction per list, for the -production.

Thompson’s Construction Syntax-directed actions can be used to perform more complex tasks. For example, Section 2.7.1 introduced Thompson’s construction for building a non-deterministic finite automaton from a regular expression. Essentially, the construction has a small “pattern” nfa for each of the four base cases in a regular expression: (1) recognizing a symbol, (2) concatenating two res, (3) alternation to choose one of two res, and (4) closure to specify zero or more occurrences of an re. The syntax of the regular expressions, themselves, might be specified with a grammar that resembles the expression grammar.

RegExpr

RegExpr Alternation

 

|

Alternation

Alternation

Alternation | Closure

 

|

Closure

Closure

Closure

 

|

( RegExpr )

 

|

symbol

The nfa construction algorithm fits naturally into a syntax-directed translation scheme based on this grammar. The actions follow Thompson’s construction. On a case-by-case basis, it takes the following actions:

1.single symbol: On a reduction of Closure symbol, the action constructs a new nfa with two states and a transition from the start state to the final state on the recognized symbol.

2.closure: On a reduction of Closure Closure , the action adds an -move to the the nfa for closure, from each of its final states to its initial state.

126

 

CHAPTER 4. CONTEXT-SENSITIVE ANALYSIS

RegExpr

RegExpr Alternation

 

 

{ add an -move from the final state of $1

 

 

to start state of $2

 

|

$$ resulting nfa }

 

Alternation

 

 

{ $$ $1 }

Alternation

Alternation | Closure

 

 

{ create new states si and sj

 

 

add an edge from si to start states for $1 & $3

 

 

add an edge from each final state of $1 & $3 to sj

 

 

designate sj as sole final state of the new nfa

 

|

$$ resulting nfa }

 

Closure

 

 

{ $$ $1 }

Closure

Closure

 

 

{ add an -move from final states of

 

 

nfa for closure to its start state

 

 

$$ $1 }

 

|

( RegExpr )

 

|

{ $$ $2 }

 

symbol

 

 

{ create two-state nfa for symbol

 

 

$$ the new nfa }

Figure 4.9: Syntax-directed version of Thompson’s construction

3.alternation: On a reduction of Alternation Alternation | Closure, the action creates new start and final states, si and sj respectively. It adds an-move from si to the start state of the nfas for alternation and closure. It adds an -move from each final state of the nfas for alternation and closure to sj . All states other than sj are marked as non-final states.

4.concatenation: On a reduction of RegExpr RegExpr Alternation the action adds an -move from each final state of the nfa for the first rexpr to the start state of the nfa for the second rexpr. It also changes the final states of the first rexpr into non-final states.

The remaining productions pass the nfa that results from higher-precedence sub-expressions upwards in the parse. In a shift-reduce parser, some representation of the nfa can be pushed onto the stack. Figure 4.9 shows how this would be accomplished in an ad hoc syntax-directed translation scheme.

4.5. WHAT QUESTIONS SHOULD THE COMPILER ASK?

127

Digression: What about Context-Sensitive Grammars?

Given the progression of ideas from the previous chapters, it might seem natural to consider the use of context-sensitive languages (csls) to address these issues. After all, we used regular languages to perform lexical analysis, and context-free languages to perform syntax analysis. A natural progression might suggest the study of csls and their grammars. Context-sensitive grammars (csgs) can express a larger family of languages than can cfgs.

However, csgs are not the right answer for two distinct reasons. First, the problem of parsing a csg is p-space complete. Thus, a compiler that used csg-based techniques could run quite slowly. Second, many of the important questions are di cult to encode in a csg. For example, consider the issue of declaration before use. To write this rule into a csg would require distinct productions for each combination of declared variables. With a su ciently small name space, this might be manageable; in a modern language with a large name space, the set of names is too large to encode into a csg.

4.5What Questions Should the Compiler Ask?

Recall that the goal of context-sensitive analysis is to prepare the compiler for the optimization and code generation tasks that will follow. The questions that arise in context-sensitive analysis, then, divide into two broad categories: checking program legality at a deeper level than is possible with the cfg, and elaborating the compiler’s knowledge base from non-local context to prepare for optimization and code generation.

Along those lines, many context-sensitive questions arise.

Given a variable a, is it a scalar, an array, a structure, or a function? Is it declared? In which procedure is it declared? What is its datatype? Is it actually assigned a value before it is used?

For an array reference b[i,j,k], is b declared as an array? How many dimensions does b have? Are i, j, and k declared with a data type that is valid for an array index expression? Do i, j, and k have values that place b[i,j,k] inside the declared bounds of b?

Where can a and b be stored? How long must their values be preserved? Can either be kept in a hardware register?

In the reference *c, is c declared as a pointer? Is *c an object of the appropriate type? Can *c be reached via another name?

How many arguments does the function fee take? Do all of its invocations pass the right number of arguments? Are those arguments of the correct type?

Does the function fie() return a known constant value? Is it a pure function of its parameters, or does the result depend on some implicit state (i.e., the values of static or global variables)?

128

CHAPTER 4. CONTEXT-SENSITIVE ANALYSIS

Most of these questions share a common trait. Their answers involve information that is not locally available in the syntax. For example, checking the number and type of arguments at a procedure call requires knowledge of both the procedure’s declaration and the call site in question. In many cases, these two statements will be separated by intervening context.

4.6Summary and Perspective

In Chapters 2 and 3, we saw that much of the work in a compiler’s front end can be automated. Regular expressions work well for lexical analysis. Context-free grammars work well for syntax analysis. In this chapter, we examined two ways of performing context-sensitive analysis.

The first technique, using attribute grammars, o ers the hope of writing high-level specifications that produce reasonably e cient executables. Attribute grammars have found successful application in several domains, ranging from theorem provers through program analysis. (See Section 9.6 for an application in compilation where attribute grammars may be a better fit.) Unfortunately, the attribute grammar approach has enough practical problems that it has not been widely accepted as the paradigm of choice for context sensitive analysis.

The second technique, called ad hoc syntax-directed translation, integrates arbitrary snippets of code into the parser and lets the parser provide sequencing and communication mechanisms. This approach has been widely embraced, because of its flexibility and its inclusion in most parser generator systems.

Questions

1.Sometimes, the compiler writer can move an issue across the boundary between context-free and context-sensitive analysis. For example, we have discussed the classic ambiguity that arises between function invocation and array references in Fortran 77 (and other languages). These constructs might be added to the classic expression grammar using the productions:

factor ident ( ExprList ) ExprList expr

|ExprList , expr

Unfortunately, the only di erence between a function invocation and an array reference lies in how the ident is declared.

In previous chapters, we have discussed using cooperation between the scanner and the parser to disambiguate these constructs. Can the problem be solved during context-sensitive analysis? Which solution is preferable?

2.Sometimes, a language specification uses context-sensitive mechanisms to check properties that can be tested in a context free way. Consider the grammar fragment in Figure 4.7. It allows an arbitrary number of StorageClass specifiers when, in fact, the standard restricts a declaration to a single StorageClass specifier.

Соседние файлы в предмете Электротехника