Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Cooper K.Engineering a compiler

.pdf
Скачиваний:
51
Добавлен:
23.08.2013
Размер:
2.31 Mб
Скачать

10.6. HARDER PROBLEMS

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

283

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

r0

r1

r2

 

·

 

6· ·

 

 

 

r31

 

 

 

 

-

 

 

 

r32

r33

r34

 

 

·

 

·6·

 

 

 

 

 

r63

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

6

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

6

 

 

 

 

 

 

6

 

 

6

 

 

 

 

 

 

6

 

 

6

 

 

 

?

 

 

?

 

 

?

 

?

 

 

 

 

 

 

 

 

?

 

 

?

 

 

?

 

 

 

?

 

 

 

 

 

fu0

 

 

fu1

 

fu2

 

fu3

 

 

 

 

 

-

 

 

 

fu4

 

 

fu5

 

fu6

 

fu7

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

inter-partition

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

interconnect

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

r64

r65

r66

 

·

 

6· ·

 

 

 

r95

 

 

 

 

-

 

 

r97

r98

r99

 

 

·

 

·6·

 

 

 

r127

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

6

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

6

 

 

 

 

 

 

6

 

 

6

 

 

 

 

 

 

 

 

6

 

6

 

 

 

 

 

 

?

 

 

?

 

 

?

 

?

 

 

 

 

 

 

 

?

 

?

 

 

?

 

?

 

 

 

 

 

fu8

 

 

fu9

 

fu10

 

fu11

 

 

 

 

 

 

 

fu12

 

fu13

 

fu14

 

fu15

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 10.11: A clustered register-set machine

dures, the allocator must deal with parameter binding mechanisms and with the side e ects of linkage conventions. Each of these creates new situations for the allocator. Call-by-reference parameter binding can link otherwise independent live ranges in distinct procedures together into a single interprocedural live range. Furthermore, they can introduce ambiguity into the memory model, by creating multiple names that can access a single memory location. (See the discussion of “aliasing” in Chapter 13.) Register save/restore conventions introduce mandatory spills that might, for a single intraprocedural live range, involve multiple memory locations.

10.6.2Partitioned Register Sets

New complications can arise from new hardware features. For example, consider the non-uniform costs that arise on machines with partitioned register sets. As the number of functional units rises, the number of registers required to hold operands and results rises. Limitations arise in the hardware logic required to move values between registers and functional units. To keep the hardware costs manageable, some architects have partitioned the register set into smaller register files and clustered functional units around these partitions. To retain generality, the processor typically provides some limited mechanism for moving values between clusters. Figure 10.11 shows a highly abstracted view of such a processor. Assume, without loss of generality, that each cluster has an identical set of functional units and registers, except for their unique names.

Machines with clustered register sets layer a new set of complications onto the register assignment problem. In deciding where to place lri in the register set, the allocator must understand the availability of registers in each cluster, the cost and local availability of the inter-cluster transfer mechanism, and the specific functional units that will execute the operations that reference lri. For example, the processor might allow each cluster to generate one o -cluster register reference per cycle, with a limit of one o -cluster transfer out of each cluster each cycle. With this constraint, the allocator must pay attention to

284

CHAPTER 10. REGISTER ALLOCATION

the placement of operations in clusters and in time. (Clearly, this requires attention from both the scheduler and the register allocator.) Another processor might require the code to execute a register-to-register move instruction, with limitations on the number of values moving in and out of each cluster in a single cycle. Under this constraint, cross-cluster use of a value requires extra instructions; if done on the critical path, this can lengthen overall execution time.

10.6.3Ambiguous Values

A final set of complications to register allocation arise from shortcomings in the compiler’s knowledge about the runtime behavior of the program. Many source-language constructs create ambiguous references (see Section 8.2), including array-element references, pointer-based references, and some call-by- reference parameters. When the compiler cannot determine that a reference is unambiguous, it cannot keep the value in a register. If the value is heavily used, this shortcoming in the compiler’s analytical abilities can lead directly to poor run-time performance.

For some codes, heavy use of ambiguous values is a serious performance issue. When this occurs, the compiler may find it profitable to perform more detailed and precise analysis to remove ambiguity. The compiler might perform interprocedural data-flow analysis that reasons about the set of values that might be reachable from a specific pointer. It might rely on careful analysis in the front-end to recognize unambiguous cases and encode them appropriately in the il. It might rely on transformations that rewrite the code to simplify analysis.

To improve allocation of ambiguous values, several systems have included transformations that rewrite the code to keep unambiguous values in scalar local variables, even when their “natural” home is inside an array element or a pointer-based structure. Scalar replacement uses array-subscript analysis to identify reuse of array element values and to introduce scalar temporary variables that hold reused values. Register promotion uses data-flow analysis on pointer values to determine when a pointer-based value can be kept safely in a register throughout a loop nest, and to rewrite the code so that the value is kept in a newly introduced temporary variable. Both of these transformations move functionality from the register allocator into earlier phases of the compiler. Because they increase the demand for registers, they increase the cost of a misstep during allocation, and, conversely, increase the need for stable, predictable allocation. Ideally, these techniques should be integrated into the allocator, to ensure a fair competition between these “promoted” values and other values that are candidates for receiving a register.

10.7Summary and Perspective

Because register allocation is an important component of a modern compiler, it has received much attention in the literature. Strong techniques exist for local allocation, for global allocation, and for regional allocation. Because the under-

10.7. SUMMARY AND PERSPECTIVE

285

lying problems are almost all np-complete, the solutions tend to be sensitive to small decisions, such as how ties between identically ranked choices are broken.

We have made progress on register allocation by resorting to paradigms that give us leverage. Thus, graph-coloring allocators have been popular, not because register allocation is identical to graph coloring, but rather because coloring captures some of the critical aspects of the global problem. In fact, most of the improvements to the coloring allocators have come from attacking the points where the coloring paradigm does not accurately reflect the underlying problem, such as live range splitting, better cost models, and improved methods for live range splitting.

Questions

1.The top-down local allocator is somewhat naive in its handling of values. It allocates one value to a register for the entire basic block.

(a)An improved version might calculate live ranges within the block and allocate values to registers for their live ranges. What modifications would be necessary to accomplish this?

(b)A further improvement might be to split the live range when it cannot be accommodated in a single register. Sketch the data structures and algorithmic modifications that would be needed to (1) break a live range around an instruction (or range of instructions) where a register is not available, and to (2) re-prioritize the remaining pieces of the live range.

(c)With these improvements, the frequency count technique should generate better allocations. How do you expect your results to compare with using Best’s algorithm? Justify your answer.

2.When a graph-coloring global allocator reaches the point where no color is available for a particular live range, lri, it spills or splits that live range. As an alternative, it might attempt to re-color one or more of lri’s neighbors. Consider the case where lri, lrj I and lri, lrk I, butlrj , lrk I. If lrj and lrk have already been colored, and have received di erent colors, the allocator might be able to re-color one of them to the other’s color, freeing up a color for lri.

(a)Sketch an algorithm for discovering if a legal and productive recoloring exists for lri.

(b)What is the impact of your technique on the asymptotic complexity of the register allocator?

(c)Should you consider recursively re-coloring lrk’s neighbors? Explain your rationale.

3.The description of the bottom-up global allocator suggests inserting spill code for every definition and use in the spilled live range. The top-down

286

CHAPTER 10. REGISTER ALLOCATION

global allocator first breaks the live range into block-sized pieces, then combines those pieces when the result is unconstrained, and finally, assigns them a color.

(a)If a given block has one or more free registers, spilling a live range multiple times in that block is wasteful. Suggest an improvement to the spill mechanism in the bottom-up global allocator that avoids this problem.

(b)If a given block has too many overlapping live ranges, then splitting a spilled live range does little to address the problem in that block. Suggest a mechanism (other than local allocation) to improve the behavior of the top-down global allocator inside blocks with high demand for registers.

Chapter Notes

Best’s algorithm, detailed in Section 10.2.2 has been rediscovered repeatedly. Backus reports that Best described the algorithm to him in the mid-1950’s, making it one of the earliest algorithms for the problem [3, 4]. Belady used the same ideas in his o ine page replacement algorithm, min, and published it in the mid-1960’s [8]. Harrison describes these ideas in connection with a regional register allocator in the mid-1970’s [36]. Fraser and Hanson used these ideas in the lcc compiler in the mid-1980s. [28]. Liberatore et al. rediscovered and reworked this algorithm in the late 1990s [42]. They codified the notion of spilling clean values before spilling dirty values.

Frequency counts have a long history in the literature. . . .

The connection between graph coloring and storage allocation problems that arise in a compiler was originally suggested by the Soviet mathematician Lavrov [40]. He suggested building a graph that represented conflicts in storage assignment, enumerating its various colorings, and using the coloring that required the fewest colors. The Alpha compiler project used coloring to pack data into memory [29, 30].

Top-down graph-coloring begins with Chow. His implementation worked from a memory-to-memory model, so allocation was an optimization that improved the code by eliminating loads and stores. His allocator used an imprecise interference graph, so the compiler used another technique to eliminate extraneous copy instructions. The allocator pioneered live range splitting, using the scheme described on page 273 Larus built a top-down, priority-based allocator for Spur-Lisp. It used a precise interference graph and operated from a register-to-register model.

The first coloring allocator described in the literature was due to Chaitin and his colleagues at Ibm [22, 20, 21]. The description of a bottom-up allocator in Section 10.4.5 follows Chaitin’s plan, as modified by Briggs et al. [17]. Chaitin’s contributions include the fundamental definition of interference, the algorithms for building the interference graph, for coalescing, and for handling spills. Briggs modified Chaitin’s scheme by pushing constrained live ranges

10.7. SUMMARY AND PERSPECTIVE

287

onto the stack rather than spilling them directly; this allowed Briggs’ allocator to color a node with many neighbors that used few colors. Subsequent improvements in bottom-up coloring have included better spill metrics [10], methods for rematerializing simple values [16], iterated coalescing [33], methods for spilling partial live ranges [9], and methods for live range splitting [26]. The large size of precise interference graphs led Gupta, So a, and Steele to work on splitting the graph with clique separators [34]. Harvey et al. studied some of the tradeo s that arise in building the interference graph [25].

The problem of modeling overlapping register classes, such as singleton registers and paired registers, requires the addition of some edges to the interference graph. Briggs et al. describe the modifications to the interference graph that are needed to handle several of the commonly occurring cases [15]. Cytron & Ferrante, in their paper “What’s in a name?”, give a polynomial-time algorithm for performing register assignment. It assumes that, at each instruction, | Live | < k. This corresponds to the assumption that the code has already been allocated.

Beatty published one of the early papers on regional register allocation [7]. His allocator performed local allocation and then used a separate technique to remove unneeded spill code at boundaries between local allocation regions— particularly at loop entries and exits. The hierarchical coloring approach was described by Koblenz and Callahan and implemented in the compiler for the Tera computer [19]. The probabilistic approach was presented by Proebsting and Fischer[44].

288

CHAPTER 10. REGISTER ALLOCATION

Chapter 11

Instruction Scheduling

11.1Introduction

The order in which operations are presented for execution can have a significant a ect on the length of time it takes to execute them. Di erent operations may take di erent lengths of time. The memory system may take more than one cycle to deliver operands to the register set or to one of the functional units. The functional units themselves may take several execution cycles to deliver the results of a computation. If an operation tries to reference a result before it is ready, the processor typically delays the operation’s execution until the value is ready—that is, it stalls. The alternative, used in some processors, is to assume that the compiler can predict these stalls and reorder operations to avoid them. If no useful operations can be inserted to delay the operation, the compiler must insert one or more nops. In the former case, referencing the result too early causes a performance problem. In the latter case, the hardware assumes that this problem never happens, so when it does, the computation produces incorrect results. In either case, the compiler should carefully consider the ordering of instructions to avoid the problem.

Many processors can initiate execution on more than one operation in each cycle. The order in which the operations are presented for execution can determine the number of operations started, or issued, in a cycle. Consider, for example, a simple processor with one integer functional unit and one floatingpoint functional unit and a compiled loop that consists of one hundred integer operations and one hundred floating-point operations. If the compiler orders the operations so that the first seventy five operations are integer operations, the floating-point unit will sit idle until the processor can (finally) see some work for the floating-point unit. If all the operations were independent (an unrealistic assumption), the best order would be to alternate operations between the two units.

Most processors that issue multiple operations in each cycle have a simple algorithm to decide how many operations to issue. In our simple two functional unit machine, for example, the processor might examine two instructions at a

289

290

 

 

 

 

CHAPTER 11.

INSTRUCTION SCHEDULING

 

Start

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Start

 

 

 

 

 

 

1

loadAI

r0,

0

r1

 

 

 

 

 

 

 

 

1

 

loadAI

r0, 0

r1

4

add

r1,

r1

r1

2

 

loadAI

r0, 8

r2

5

loadAI

r0,

8

r2

3

 

loadAI

r0, 16

r3

8

mult

r1,

r2

r1

4

 

add

r1, r1

r1

9

loadAI

r0,

16

r2

5

 

mult

r1, r2

r1

12

mult

r1

,

r2

r1

6

 

loadAI

r0, 24

r2

13

loadAI

r0

,

24

r2

7

 

mult

r1, r3

r1

16

mult

r1

, r2

r1

9

 

mult

r1, r2

r1

18

storeAI

r1

 

 

r0,0

11

 

storeAI

r1

r0,0

 

 

Original code

 

 

 

 

Scheduled code

 

 

Figure 11.1: Scheduling Example From Introduction

time. It would issue the first operation to the appropriate functional unit, and, if the second operation can execute on the other unit, issue it. If both operations need the same functional unit, the processor must delay the second operation until the next cycle. Under this scheme, the speed of the compiled code depends heavily on instruction order.

This kind of sensitivity to the order of operations suggests that the compiler should reorder operations in a way that produces faster code. This problem, reordering a sequence of operations to improve its execution time on a specific processor, has been called instruction scheduling. Conceptually, an instruction scheduler looks like

slow

 

instruction

fast

 

-

-

 

scheduler

code

 

code

 

 

 

 

 

 

 

The primary goals of the instruction scheduler are to preserve the meaning of the code that it receives as input, to minimize execution time by avoiding wasted cycles spent in interlocks and stalls, and to avoid introducing extra register spills due to increased variable lifetimes. Of course, the scheduler should operate e ciently.

This chapter examines the problem of instruction scheduling, and the tools and techniques that compilers use to solve it. The next several subsections provide background information needed to discuss scheduling and understand both the algorithms and their impact.

11.2The Instruction Scheduling Problem

Recall the example given for instruction scheduling in Section 1.3. Figure 11.1 reproduces it. The column labelled “Start” shows the cycle in which each operation executes. Assume that, the processor has a single functional unit; memory

11.2. THE INSTRUCTION SCHEDULING PROBLEM

291

operations take three cycles; a mult takes two cycles; all other operations complete in a single cycle; and r0 holds the activation record pointer (arp). Under these parameters, the original code, shown on the left, takes twenty cycles.

The scheduled code, shown on the right, is much faster. It separates longlatency operations from operations that reference their results. This allows operations that do not depend on these results to execute concurrently. The code issues load operations in the first three cycles; the results are available in cycles 4, 5, and 6 respectively. This requires an extra register, r3, to hold the result of the third concurrently executing load operation, but it allows the processor to perform useful work while waiting for the first arithmetic operand to arrive. In e ect, this hides the latency of the memory operations. The same idea, applied throughout the block, hides the latency of the mult operation. The reordering reduces the running time to thirteen cycles, a thirty-five percent improvement.

Not all blocks are amenable to improvement in this fashion. Consider, for example, the following block that computes x256:

Start

 

 

 

 

 

 

1

loadAI

r0,@x r1

2

mult

r1,r1

r1

4

mult

r1,r1

r1

6

mult

r1,r1

r1

8

storeAI

r1

r0,@x

The three mult operations have long latencies. Unfortunately, each instruction uses the result of the previous instruction. Thus, the scheduler can do little to improve this code because it has no independent instructions that can be issued while the mults are executing. Because it lacks independent operations that it can execute in parallel, we say that this block has no instruction-level parallelism (ilp). Given enough ilp, the scheduler can hide memory latency and functional-unit latency.

Informally, instruction scheduling is the process whereby a compiler reorders the operations in the compiled code in an attempt to decrease its running time. The instruction scheduler takes as input an ordered list of instructions; it produces as output a list of the same instructions.1 The scheduler assumes a fixed set of operations—that is, it does not add operations in the way that the register allocator adds spill code. The scheduler assumes a fixed name space—it does not change the number of enregistered values, although a scheduler might perform some renaming of specific values to eliminate conflicts. The scheduler expresses its results by rewriting the code.

To define scheduling more formally requires introduction of a precedence graph P = (N, E) for the code. Each node n N is an instruction in the input code fragment. An edge e = (n1, n2) E if and only if n2 uses the result of n1 as an argument. In addition to its edges, each node has two attributes, a type

1Throughout the text, we have been careful to distinguish between operations—individual commands to a single functional unit—and instructions—all of the operations that execute in a single cycle. Here, we mean “instructions.”

292

 

 

CHAPTER 11.

INSTRUCTION SCHEDULING

a : loadAI

r0, 0

r1

a

c

 

 

?

 

e

 

b : add

r1, r1

r1

b

 

 

c : loadAI

r0, 8

r2

 

@R

 

 

 

 

d : mult

r1, r2

r1

 

d

 

 

 

g

 

R@

 

 

 

e : loadAI

r0, 16

r2

 

 

f

 

 

 

f : mult

r1

, r2

r1

 

 

 

R@

 

g : loadAI

r0

, 24

r2

 

 

 

h

 

h : mult

r1

, r2

r1

 

 

 

 

?

i : storeAI

r1

 

r0,0

 

 

 

i

 

 

 

 

 

 

 

 

Example code

 

Its precedence graph

Figure 11.2: Precedence Graph for the Example

and a delay. For a node n, the instruction corresponding to n must execute on a functional unit of type type(n) and it requires delay(n) cycles to complete. The example in Figure 11.1 produces the precedence graph shown in Figure 11.2.

Nodes with no predecessors in the precedence graph, such as a, c, e, and g in the example, are called leaves of the graph. Since the leaves depend on no other operations, they can be scheduled at any time. Nodes with no successors in the precedence graph, such as i in the example, are called roots of the graph. A precedence graph can have multiple roots. The roots are, in some sense, the most constrained nodes in the graph because they cannot execute until all of their ancestors execute. With this terminology, it appears that we have drawn the precedence graph upside down—at least with relationship to the syntax trees and abstract syntax trees used earlier. Placing the leaves at the top of the figure, however, creates a rough correspondence between placement in the drawing and eventual placement in the scheduled code. A leaf is at the top of the tree because it can execute early in the schedule. A root is at the bottom of the tree because it must execute after each its ancestors.

Given a precedence graph for its input code fragment, a schedule S maps each node n N into a non-negative integer that denotes the cycle in which it should be issued, assuming that the first operation issues in cycle one. This provides a clear and concise definition of an instruction: the ith instruction is the set of operations n S(n) = i. A schedule must meet three constraints.

1.S(n) 0, for each n N . This constraint forbids operations from being issued before execution starts. Any schedule that violates this constraint

is not well formed. For the sake of uniformity, the schedule must also have at least one operation n with S(n ) = 1.

2.If (n1, n2) E, S(n1) + delay(n1) ≤ S(n2). This constraint enforces correctness. It requires that an operation cannot be issued until the operations that produce its arguments have completed. A schedule that violates this rule changes the flow of data in the code and is likely to produce in-

Соседние файлы в предмете Электротехника