Cooper K.Engineering a compiler
.pdf10.3. MOVING BEYOND SINGLE BLOCKS |
263 |
1. loadI @stack rarp
2.loadAI rarp, 0 rw
3. loadI 2 |
r2 |
4.loadAI rarp, 8 rx
5. |
loadAI |
rarp, 16 |
ry |
6. |
loadAI |
rarp, 24 |
rz |
7. |
mult |
rw , r2 |
rw |
8. |
mult |
rw , rx |
rw |
9. |
mult |
rw , ry |
rw |
10. |
mult |
rw , rz |
rw |
11. |
storeAI rw |
rarp, 0 |
|
Live |
Register |
|
Range |
Name |
Interval |
1 |
rarp |
[1,11] |
2 |
rw |
[2,7] |
3 |
rw |
[7,8] |
4 |
rw |
[8,9] |
5 |
rw |
[9,10] |
6 |
rw |
[10,11] |
7 |
r2 |
[3,7] |
8 |
rx |
[4,8] |
9 |
ry |
[5,9] |
10 |
rz |
[6,10] |
Figure 10.4: Live ranges in a basic block
live range as the definition.
The term “live range” relies, implicitly, on the notion of liveness—one of the fundamental ideas in compile-time analysis of programs.
At some point p in a procedure, a value v is live if it has been defined along a path from the procedure’s entry to p and a path along which v is not redefined exists from p to a use of v
Thus, if v is live at p, then v must be preserved because subsequent execution may use v. The definition is carefully worded. A path exists from p to a use of v. This does not guarantee that any particular execution will follow the path, or that any execution will ever follow the path. The existence of such a path, however, forces the compiler to preserve v for the potential use.
The set of live ranges is distinct from the set of variables or values. Every value computed in the code is part of some live range, even if it has no name in the original source code. Thus, the intermediate results produced by address computations have live ranges, just the same as programmer-named variables, array elements, and addresses loaded for use as a branch target. A specific programmer-named variable may have many distinct live ranges. A register allocator that uses live ranges can place those distinct live ranges in di erent registers. Thus, a variable, x, might reside in di erent registers at two distinct points in the executing program.
To make these ideas concrete, consider the problem of finding live ranges in a single basic block. Figure 10.4 shows the block from Figure 1.1, with an initial operation added to initialize rarp. All other references to rarp inside the block are uses rather than definitions. Thus, a single value for rarp is used throughout the block. The interval [1, 11] represents this live range. Consider rw . Operation 1 defines rw ; operation 6 uses that value. Operations 6, 7, 8, and 9 each define a new value stored in rw ; in each case, the following operation uses the value. Thus, the register named rw in the figure holds a number of distinct live ranges—specifically [2, 7], [7, 8], [8, 9], [9, 10], and [10, 1]. A register
264 |
|
|
|
|
CHAPTER 10. REGISTER ALLOCATION |
||||||
|
|
|
|
|
|
|
|
||||
B1 |
|
storeAI r7XXro,@x |
|
|
B2 |
storeAI r |
3 ro,@x |
|
|||
|
|
|
|
|
|
||||||
|
|
|
XXXX |
XX |
|
|
|
|
|
||
|
|
|
|
|
|
|
|
|
|
||
|
|
|
|
|
|
XX |
|
||||
|
|
|
|
|
|
|
XX |
|
|||
? |
|
|
|
|
XzX |
? |
|||||
|
|
B3 |
loadAI r0,@x r2 |
|
B4 |
loadAI r0,@x r4 |
|
||||
Figure 10.5: Problems with multiple blocks
allocator need not keep these distinct live ranges of rw in the same register. Instead, each live range in the block can be treated as an independent value for allocation and assignment. The table on the right side of Figure 10.4 shows all of the live ranges in the block.
To find live ranges in regions larger than a single block, the compiler must discover the set of values that are live on entry to each block, as well as those that are live on exit from each block. To summarize this information, the compiler can annotate each basic block b with sets LiveIn(b) and LiveOut(b)
LiveIn A value x LiveIn(b) if and only if it is defined along some path through the control-flow graph that leads to b and it is either used directly in b, or is in LiveOut(b). That is, x LiveIn(b) implies that x is live just before the first operation in b.
LiveOut A value x LiveOut(b) if and only if it is used along some path leaving b, and it is either defined in b or is in LiveIn(b). That is, x is live immediately after the last operation in b.
Chapter 13 shows how to compute LiveIn and LiveOut sets for each block. At any point p in the code, values that are not live need no register. Similarly, the only values that need registers at point p are those values that are live at p, or some subset of those values. Local register allocators, when implemented in real compilers, use Live sets to determine when a value must be preserved in memory beyond its last use in the block. Global allocators use analogous information to discover live ranges and to guide the allocation process.
10.3.2Complications at Block Boundaries
A compiler that uses local register allocation might compute LiveIn and LiveOut sets for each block as a necessary prelude to provide the local allocator with information about the status of values at the block’s entry and its exit. The presence of these sets can simplify the task of making the allocations for individual blocks behave appropriately when control flows from one block to another. For example, a value in LiveOut(b) must be stored back to memory after a definition in b; this ensures that the value will be in the expected location when it is loaded in a subsequent block. In contrast, if the value is not in LiveOut(b), it need not be stored, except as a spill for later use inside b.
Some of the e ects introduced by considering multiple blocks complicate either assignment or allocation. Figure 10.5 suggests some of the complications
10.3. MOVING BEYOND SINGLE BLOCKS |
265 |
that arise in global assignment. Consider the transition that occurs along the edge from block B1 to block B3.
B1 has the value of program variable x in r7. B3 wants it in r2. When it processes B1, the allocator has no knowledge of the context created by the other blocks, so it must store x back to x’s location in memory (at o set @x from the arp in r0). Similarly, when the allocator processes B3, it has no knowledge about the behavior of B1, so it must load x from memory. Of course, if it knew the results of allocation on B1, it could assign x to r7 and make the load unnecessary. In the absence of this knowledge, it must generate the load. The references to x in B2 and B4 further complicate the problem. Any attempt to coordinate x’s assignment across blocks must consider both those blocks since B4 is a successor of B1, and any change in B4’s treatment of x has an impact in its other predecessor, B2.
Similar e ects arise with allocation. What if x were not referenced in B2? Even if we could coordinate assignment globally, to ensure that x was always in r7 when it was used, the allocator would need to insert a load of x at the end of B2 to let B4 avoid the initial load of x. Of course, if B2 had other successors, they might not reference x and might need another value in r7.
These fringe e ects at block boundaries can become complex. They do not fit into the local allocators because they deal with phenomena that are entirely outside its scope. If the allocator manages to insert a few extra instructions that iron out the di erences, it may choose to insert them in the wrong block—for example, in a block that forms the body of an inner loop rather than in that loop’s header block. The local models assume that all instructions execute with the same frequency; stretching the models to handle larger regions invalidates that assumption, too. The di erence between a good allocation and a poor one may be a few instructions in the most heavily executed block in the code.
A second issue, more subtle but more problematic, arises when we try to stretch the local allocation paradigms beyond single blocks. Consider using Best’s algorithm on block B1. With only one block, the notion of the “furthest” next reference is clear. The local algorithm has a unique distance to each next reference. With multiple successor blocks, the allocator must choose between references along di erent paths. For the last reference to some value y in B1, the next reference is either the first reference to y in B3 or the first reference to y in B4. These two references are unlikely to be in the same position, relative to the end of B1. Alternatively, B3 might not contain a reference to y, while B4 does. Even if both blocks use y, and the references are equidistant in the input code, local spilling in one block might make them di erent in unpredictable ways. The basic premise of the bottom-up local method begins to crumble in the presence of multiple control-flow paths.
All of these problems suggest that a di erent approach is needed to move beyond local allocation to regional or global allocation. Indeed, the successful global allocation algorithms bear little resemblance to the local algorithms.
266 |
CHAPTER 10. REGISTER ALLOCATION |
10.4Global Register Allocation and Assignment
The register allocator’s goal is to minimize the execution time required for instructions that it must insert. This is a global issue, not a local one. From the perspective of execution time, the di erence between two di erent allocations for the same basic code lies in the number of loads, stores, and copy operations inserted by the allocator and their placement in the code. Since di erent blocks execute di erent numbers of times, the placement of spills has a strong impact on the amount of execution time spent in spill code. Since block execution frequencies can vary from run to run, the notion of a best allocation is somewhat tenuous—it must be conditioned to a particular set of block execution frequencies.
Global register allocation di ers from local allocation in two fundamental ways.
1.The structure of a live range can be more complex than in the local allocator. In a single block, a live range is just an interval in a linear string of operations. Globally, a live range is the set of definitions that can reach a given use, along with all the uses that those definitions can reach. Finding live ranges is more complex in a global allocator.
2.Distinct references to the same variable can execute a di erent number of times. In a single block, if any operation executes, all the operations execute (unless an exception occurs), so the cost of spilling is uniform. In a larger scope, each reference can be in a di erent block, so the cost of spilling depends on where the references are found. When it must spill, the global allocator should consider the spill cost of each live range that is a candidate to spill.
Any global allocator must address both these issues. This makes global allocation substantially more complex than local allocation.
To address the issue of complex live ranges, global allocators explicitly create a name space where each distinct live range has a unique name. Thus, the allocator maps a live range onto either a physical register or a memory location. To accomplish this, the global allocator first constructs live ranges and renames all the virtual register references in the code to reflect the new name space constructed around the live ranges. To address the issue of execution frequencies, the allocator can annotate each reference or each basic block with an estimated execution frequency. The estimates can come from static analysis or from profile data gathered during actual executions of the program. These estimated execution frequencies will be used later in the allocator to guide decisions about allocation and spilling.
Finally, global allocators must make decisions about allocation and assignment. They must decide when two values can share a single register, and they must modify the code to map each such value to a specific register. To accomplish these tasks, the allocator needs a model that tells it when two values can (and cannot) share a single register. It also needs an algorithm that can use the
10.4. GLOBAL REGISTER ALLOCATION AND ASSIGNMENT |
267 |
Digression: Graph Coloring
Many global register allocators use graph coloring as a paradigm to model the underlying allocation problem. For an arbitrary graph G, a coloring of G assigns a color to each node in G so that no pair of adjacent nodes have the same color. A coloring that uses k colors is termed a k coloring, and k is the graph’s chromatic number. Consider the following graphs:
1 i |
|
1 i |
|
|||
,, |
@@ |
|
,, |
|
@@ |
|
, |
|
@ |
, |
|
|
@ |
|
|
|||||
2 i 3 i 4 i |
2 i 3 i 4 i |
|||||
@@@ |
,,, |
@@@ |
|
,,, |
||
5 i |
|
5 i |
|
|||
The graph on the left is 2-colorable. For example, assigning blue to nodes 1 and 5, and red to nodes 2, 3, and 4 produces the desired result. Adding one edge, as shown on the right, makes the graph 3-colorable. (Assign blue to nodes 1 and 5, red to nodes 2 and 4, and white to node 3.) No 2-coloring exists for the right-hand graph.
For a given graph, the problem of finding its minimal chromatic number is np-complete. Similarly, the problem of determining if a graph is k-colorable, for some fixed k, is np-complete. Algorithms that use graphcoloring as a paradigm for allocating finite resources use approximate methods that try to discover colorings into the set of available resources.
model to derive e ective and e cient allocations. Many global allocators operate on a graph-coloring paradigm. They build a graph to model the conflicts between registers and attempt to find an appropriate coloring for the graph. The allocators that we discuss in this section all operate within this paradigm.
10.4.1Discovering Global Live Ranges
To construct live ranges, the compiler must discover the relationships that exist between di erent definitions and uses. The allocator must derive a name space that groups together all the definitions that reach a single use and all the uses that a single definition can reach. This suggests an approach where the compiler assigns each definition a unique name and merges definition names together when they reach a common use.
The static single assignment form (ssa) of the code provides a natural starting point for this construction. Recall, from Section 6.3.6, that ssa assigns a unique name to each definition and inserts φ-functions to ensure that each use refers to only one definition. The φ-functions concisely record the fact that distinct definitions on di erent paths in the control-flow graph reach a single reference. Two definitions that flow into a φ-function are belong in the same live range because the φ-function creates a name representing both values. Any
268 |
CHAPTER 10. REGISTER ALLOCATION |
operation that references the name created by the φ-function uses one of these values; the specific value depends on how control-flow reached the φ-function. Because the two definitions can be referenced in the same use, they belong in the same register. Thus, φ-functions are the key to building live ranges.
To build live ranges from ssa, the allocator uses the disjoint-set union-find algorithm [27] and makes a single pass over the code. First, the allocator assigns a distinct set to each ssa name, or definition. Next, it examines each φ-function in the program, and unions together the sets of each φ-function parameter. After all the φ-functions have been processed, the resulting sets represent the maximal live ranges of the code. At this point, the allocator can rewrite the code to use the live range names. (Alternatively, it can maintain a mapping between ssa names and live-range names, and add a level of indirection in its subsequent manipulations.)
10.4.2Estimating Global Spill Costs
To let it make informed spill decisions, the global allocator must estimate the costs of spilling each value. The value might be a single reference, or it might be an entire live range. The cost of a spill has three components: the address computation, the memory operation, and an estimated execution frequency.
The compiler can choose the spill location in a way that minimizes the cost of addressing; usually, this means keeping spilled values in the procedure’s activation record. In this scenario, the compiler can generate an operation such as loadAI or storeAI for the spill. As long as the arp is in a register, the spill should not require additional registers for the address computation.
The cost of the memory operation is, in general, unavoidable. If the target machine has local (i.e., on-chip) memory that is not cached, the compiler might use that memory for spilling. More typically, the compiler needs to save the value in the computer’s main memory and to restore it from that memory when a later operation needs the value. This entails the full cost of a load or store operation. As memory latencies rise, the cost of these operations grows. To make matters somewhat worse, the allocator only inserts spill operations when it absolutely needs a register. Thus, many spill operations occur in code where demand for registers is high. This may keep the scheduler from moving those operations far enough to hide the memory latency. The compiler must hope that spill locations stay in cache. (Paradoxically, those locations only stay in the cache if they are accessed often enough to avoid replacement—suggesting that the code is executing too many spills.)
Negative Spill Costs A live range that contains a load, a store, and no other uses should receive a negative spill cost if the load and store refer to the same address. (Such a live range can result from transformations intended to improve the code; for example, if the use were optimized away and the store resulted from a procedure call rather than the definition of a new value.) Any live range with a negative spill cost should be spilled, since doing so decreases demand for registers and removes instructions from the code.
10.4. GLOBAL REGISTER ALLOCATION AND ASSIGNMENT |
269 |
Infinite Spill Costs Some live ranges are short enough that spilling them never helps the allocation. Consider a use that immediately follows its definition. Spilling the definition and use produces two short live ranges. The first contains the definition followed by a store; the second live range contains a load followed by the use. Neither of these new live ranges uses fewer registers than the original live range, so the spill produced no benefit. The allocator should assign the original live range a spill cost of infinity.
In general, a live range should have infinite spill cost if no interfering live range ends between its definitions and its uses, and no more than k − 1 values are defined between the definitions and the uses. The first condition stipulates that availability of registers does not change between the definitions and uses. The second avoids a pathological situation that can arise from a series of spilled copies—m loads followed by m stores, where m k. This can create a set of more than k mutually interfering live ranges; if the allocator assigns them all infinite spill costs, it will be unable to resolve the situation.
Accounting for Execution Frequencies To account for the di erent execution frequencies of the basic blocks in the control-flow graph, the compiler must annotate each block (if not each reference) with an estimated execution count. Most compilers use simple heuristics to estimate execution costs. A common method is to assume that each loop executes ten times. Thus, it assigns a count of ten to a load inside one loop, and a count of one hundred to a load inside two loops. An unpredictable if-then-else might decrease the execution count by half. In practice, these estimates ensure a bias toward spilling in outer loops rather than inner loops.
To estimate the spill cost for a single reference, the allocator forms the product
(addressing cost + cost of memory operation) × execution frequency.
For each live range, it sums the costs of the individual references. This requires a pass over all the blocks in the code. The allocator can pre-compute these costs for all live ranges, or it can wait until it discovers that it must spill a value.
10.4.3Interferences and the Interference Graph
The fundamental e ect that a global register allocator must model is the competition between values for space in the target machine’s register set. Consider two live ranges, lri and lrj . If there exists an operation where both lri and lrj are live, then they cannot reside in the same register. (In general, a physical register can hold only one value.) We say that lri and lrj interfere.
To model the allocation problem, the compiler can build an interference graph, I. Nodes in I represent individual live ranges. Edges in I represent interferences. Thus, an edge ni, nj I exists if and only if the corresponding live ranges, lri and lrj are both live at some operation. The left side of Figure 10.6 shows a code fragment that defines four live ranges, lrh, lri, lrj , and lrk . The right side shows the corresponding interference graph. lrh interferes
270 |
|
|
|
|
|
CHAPTER 10. REGISTER ALLOCATION |
|||
|
b0 |
|
|
|
|
|
|
|
|
lrh ← . . . |
|
|
|
|
|
||||
|
|
|
|
|
|
|
|||
|
|
,, @R@ |
|
lrh |
|
lrk |
|||
b1 |
lri ← . . . |
b2 |
lrj ← . . . |
|
|||||
|
@ |
|
|
||||||
|
. . . ← lri |
|
|
|
|
@ |
|
||
|
lrk ← . . . |
|
lrk ← lrj |
|
|
@ |
|||
|
|
|
|
|
|
lri |
|
lrj |
|
|
|
@@R ,, |
|||||||
|
|
|
|
|
|
||||
|
b3 |
. . . ← lrh |
|
|
|
|
|
||
|
|
. . . ← lrk |
|
|
|
|
|
||
|
Code fragment |
Interference graph |
|||||||
Figure 10.6: Live ranges and interference
with each of the other live ranges. The rest of the live ranges, however, do not interfere with each other.
If the compiler can construct a k-coloring for I, where k ≤ the size of the target machine’s register set, then it can map the colors directly onto physical registers to produce a legal allocation. In the example, lrh receives its own color because it interferes with the other live ranges. The other live ranges can all share a single color. Thus, the graph is 2-colorable and the code fragment, as shown, can be rewritten to use just two registers.
Consider what would happen if another phase of the compiler reordered the two operations at the end of b1. This makes lrk and lri simultaneously live. Since they now interfere, the allocator must add the edge lrk, lri to I. The resulting graph is not 2-colorable. The graph is small enough to prove this by enumeration. To handle this graph, the allocator has two options: use three colors (registers), or, if the target machine has only two registers, to spill one of lri or lrh before the definition of lrk in b1. Of course, the allocator could also reorder the two operations and eliminate the interference between lri and lrk . Typically, register allocators do not reorder operations to eliminate interferences. Instead, allocators assume a fixed order of operations and leave ordering questions to the instruction scheduler (see Chapter 11).
Building the Interference Graph Once the allocator has discovered global live ranges and annotated each basic block in the code with its LiveOut set, it can construct the interference graph by making a simple linear pass over each block. Figure 10.7 shows the basic algorithm. The compiler uses the block’s LiveOut set as an initial value for LiveNow and works its way backward through the block, updating LiveNow to reflect the operations already processed. At each operation, it adds an edge from the live range being defined to each live range
10.4. GLOBAL REGISTER ALLOCATION AND ASSIGNMENT |
271 |
for each lr, i
create a node ni N
for each basic block b
LiveNow(b) ← LiveOut(b)
for opn, opn−1, opn−2, . . . op1 in b with form opi lrj ,lrk lrl
for each lri in LiveNow(b), add lrl,lri to E
remove lrl from LiveNow(b) add lrj & lrk to LiveNow(b)
Figure 10.7: Constructing the Interference Graph
in LiveNow. It then incrementally updates LiveNow and moves up the block by an instruction.
This method of computing interferences takes time proportional to the size of the LiveNow sets at each operation. The naive algorithm would add edges between each pair of values in LiveNow at each operation; that would require time proportional to the square of the set sizes at each operation. The naive algorithm also introduces interferences between inputs to an operation and its output. This creates an implicit assumption that each value is live beyond its last use, and prevents the allocator from using the same register for an input and an output in the same operation.
Notice that a copy operation, such as i2i lri lrj , does not create an interference between lri and lrj . In fact, lri and lrj may occupy the same physical register, unless subsequent context creates an interference. Thus, a copy that occurs as the last use of some live range can often be eliminated by combining, or coalescing, the two live ranges (see Section 10.4.6).
To improve e ciency later in the allocator, several authors recommend building two representations for I, a lower-diagonal bit-matrix and a set of adjacency lists. The bit matrix allows a constant time test for interference, while the adjacency lists make iterating over a node’s neighbors e cient. The bit matrix might be replaced with a hash table; studies have shown that this can produce space savings for su ciently large interference graphs. The compiler writer may also treat disjoint register classes as separate allocation problems to reduce both the size of I and the overall allocation time.
Building an Allocator To build a global allocator based on the graph-coloring paradigm, the compiler writer needs two additional mechanisms. First, the allocator needs an e cient technique for discovering k-colorings. Unfortunately, the problem of determining if a k-coloring exists for a particular graph is npcomplete. Thus, register allocators use fast approximations that are not guaranteed to find a k-coloring. Second, the allocator needs a strategy for handling
272 |
CHAPTER 10. REGISTER ALLOCATION |
the case when no color remains for a specific live range. Most coloring allocators approach this by rewriting the code to change the allocation problem. The allocator picks one or more live ranges to modify. It either spills or splits the chosen live range. Spilling turns the chosen live range into a set of tiny live ranges, one at each definition and use of the original live range. Splitting breaks the live range into smaller, but non-trivial pieces. In either case, the transformed code performs the same computation, but has a di erent interference graph. If the changes are e ective, the new interference graph is easier to color.
10.4.4Top-down Coloring
A top-down, graph-coloring, global register allocator uses low-level information to assign colors to individual live ranges, and high-level information to select the order in which it colors live ranges. To find a color for a specific live range, lri, the allocator tallies the colors already assigned to lri’s neighbors in I. If the set of neighbors’ colors is incomplete—that is, one or more colors are not used—the allocator can assign an unused color to lri. If the set of neighbors’ colors is complete, then no color is available for lri and the allocator must use its strategy for uncolored live ranges.
To order the live ranges, the top-down allocator uses an external ranking. The priority-based, graph-coloring allocators rank live ranges by the estimated run-time savings that accrue from keeping the live range in a register. These estimates are analogous to the spill-costs described in Section 10.4.2. The topdown global allocator uses registers for the most important values, as identified by these rankings.
The allocator considers the live ranges, in rank order, and attempts to assign a color to each of them. If no color is available for a live range, the allocator invokes the spilling or splitting mechanism to handle the uncolored live range. To improve the process, the allocator can partition the live ranges into two sets—constrained live ranges and unconstrained live ranges. A live range is constrained if it has k or more neighbors—that is, it has degree ≥ k in I. (We denote “degree of lri” as lr◦i , so lri is constrained if and only if lr◦i ≥ k.) Constrained live ranges are colored first, in rank order. After all constrained live ranges have been handled, the unconstrained live ranges are colored, in any order. An unconstrained live range must receive a color. When lr◦i < k, no assignment of colors to lri’s neighbors can prevent lri from receiving a color.
By handling constrained live ranges first, the allocator avoids some potential spills. The alternative, working in a straight priority order, would let the allocator assign all available colors to unconstrained, but higher priority, neighbors of lri. This could force lri to remain uncolored, even though colorings of its unconstrained neighbors that leave a color for lri must exist.
Handling Spills When the top-down allocator encounters a live range that cannot be colored, it must either spill or split some set of live ranges to change the problem. Since all previously colored live ranges were ranked higher than the uncolored live range, it makes sense to spill the uncolored live range rather than a previously colored live range. The allocator can consider re-coloring one
