Cooper K.Engineering a compiler
.pdf10.4. GLOBAL REGISTER ALLOCATION AND ASSIGNMENT |
273 |
of the previously colored live ranges, but it must exercise care to avoid the full generality and cost of backtracking.
To spill lri, the allocator inserts a store after every definition of lri and a load before each use of lri. If the memory operations need registers, the allocator can reserve enough registers to handle them. The number of registers needed for this purpose is a function of the target machine’s instruction set architecture. Reserving these registers simplifies spilling.
An alternative to reserving registers for spill code is to look for free colors at each definition and each use; this strategy can lead to a situation where the allocator must retroactively spill a previously colored live range. (The allocator would recompute interferences at each spill site and compute the set of neighbor’s colors for the spill site. If this process does not discover an open color at each spill site (or reference to the live range being spilled), the allocator would spill the lowest priority neighbor of the spill site. The potential for recursively spilling already colored live ranges has led most implementors of top-down, priority-based allocators to reserve spill registers, instead.) The paradox, of course, is that reserving registers for spilling may cause spilling; not reserving those registers can force the allocator to iterate the entire allocation procedure.
Splitting the Live Range Spilling changes the coloring problem. The entire uncolored live range is broken into a series of tiny live ranges—so small that spilling them is counterproductive. A related way to change the problem is to take the uncolored live range and break it into pieces that are larger than a single reference. If these new live ranges interfere, individually, with fewer live ranges than the original live range, then the allocator may find colors for them. For example, if the new live ranges are unconstrained, colors must exist for them. This process, called live-range splitting, can lead to allocations that insert fewer loads and stores than would be needed to spill the entire live range.
The first top-down, priority-based coloring allocator broke the uncolored live range into single-block live ranges, counted interferences for each resulting live range, and then recombined live ranges from adjacent blocks when the combined live range remained unconstrained. It placed an arbitrary upper limit on the number of blocks that a split live range could span. Loads and stores were added at the starting and ending points of each split live range. The allocator spilled any split live ranges that remained uncolorable.
10.4.5Bottom-up Coloring
Bottom-up, graph-coloring register allocators use many of the same mechanisms as the top-down global allocators. These allocators discover live ranges, build an interference graph, attempt to color it, and generate spill code when needed. The major distinction between top-down and bottom-up allocators lies in the mechanism used to order live ranges for coloring. Where the top-down allocator uses high-level information to select an order for coloring, the bottom-up allocators compute an order from detailed structural knowledge about the interference graph, I. These allocators construct a linear ordering in which to consider the
274 |
CHAPTER 10. REGISTER ALLOCATION |
initialize stack
while (N = )
if n N with n◦ < k node ← n
else
node ← n picked from N remove node and its edges from I push node onto stack
Figure 10.8: Computing a bottom-up ordering
live ranges, and then assign colors in that order.
To order the live ranges, bottom-up, graph-coloring allocators rely on a familiar observation:
A live range with fewer than k neighbors must receive a color, independent of the assignment of colors to its neighbors.
The top-down allocators use this fact to partition the live ranges into constrained and unconstrained nodes. The bottom-up allocators use it to compute the order in which live ranges will be assigned colors, using the simple algorithm shown in Figure 10.8. The allocator repeatedly removes a node from the graph and places it on the stack. It uses two distinct mechanisms to select the node to remove next. The first clause takes a node that is unconstrained in the graph from which it is removed. The second clause, invoked only when every remaining node is constrained, picks a node using some external criteria. When the loop halts, the graph is empty and the stack contains all the nodes in order of removal.
To color the graph, the allocator rebuilds the interference graph in the order represented by the stack—the reverse of the order in which the allocator removed them from the graph. It repeatedly pops a node n from the stack, inserts n and its edges back into I, and looks for a color that works for n. At a high-level, the algorithm looks like:
while (stack = ) node ← pop(stack)
insert node and its edges into I color node
To color a node n, the allocator tallies the colors of n’s neighbors in the current approximation to I and assigns n an unused color. If no color remains for n, it is left uncolored.
When the stack is empty, I has been rebuilt. If every node received a color, the allocator declares success and rewrites the code, replacing live range names with physical registers. If any node remains uncolored, the allocator either spills the corresponding live range or splits it into smaller pieces. At this point, the classic bottom-up allocators rewrite the code to reflect the spills and splits, and
10.4. GLOBAL REGISTER ALLOCATION AND ASSIGNMENT |
275 |
repeat the entire process—finding live ranges, building I, and coloring it. The process repeats until every node in I receives a color. Typically, the allocator halts in a couple of iterations. Of course, a bottom-up allocator could reserve registers for spilling, as described with the top-down allocator. This would allow it to halt after a single pass.
Why does this work? The bottom-up allocator inserts each node back into the graph from which it was removed. Thus, if the node representing lri was removed from I because it was unconstrained at the time, it is re-inserted into an approximation to I where it is also unconstrained—and a color must exist for it. The only nodes that can be uncolored, then, are nodes removed from I using the spill metric in the second clause of Figure 10.8. These nodes are inserted into graphs where they have k or more neighbors. A color may exist for them. Assume that n◦ > k when the allocator inserts it into I. Those neighbors cannot all have distinct colors. They can have at most k colors. If they have precisely k colors, then the allocator finds no color for n. If, instead, they use one color, or k −1 colors, or any number between 1 and k −1, then the allocator discovers a color for n.
The removal process determines the order in which nodes are colored. This order is crucial, in that it determines whether or not colors are available. For nodes removed from the graph by virtue of having low degree in the current graph, the order is unimportant with respect to the remaining nodes. The order may be important with respect to nodes already on the stack; after all, the current node may have been constrained until some of the earlier nodes were removed. For nodes removed from the graph by the second criterion (“node picked from N ”), the order is crucial. This second criterion is invoked only when every remaining node has k or more neighbors. Thus, the remaining nodes form a heavily connected subset of I. The heuristic used to “pick” the node is often called the spill metric. The original bottom-up, graph-coloring allocator used a simple spill metric. It picked the node that minimized the fraction
estimated cost
current degree.
This picks a node that is relatively inexpensive to spill but lowers the degree of many other nodes. (Each remaining node has more than k −1 neighbors.) Other spill metrics have been tried, including minimizing estimated cost, minimizing the number of inserted operations, and maximizing removed edges. Since the actual coloring process is fast relative to building I, the allocator might try several colorings, each using a di erent spill metric, and retain the best result.
10.4.6Coalescing Live Ranges to Reduce Degree
A powerful coalescing phase can be built that uses the interference graph to determine when two live ranges that are connected by a copy can be coalesced, or combined. Consider the operation i2i lri lrj . If lri and lrj do not otherwise interfere, the operation can be eliminated and all references to lrj rewritten to use lri. This has several beneficial e ects. It directly eliminates the
276 |
|
|
CHAPTER 10. |
REGISTER ALLOCATION |
|
add lrt,lru |
lri |
add |
lrt,lru |
lrij |
|
|
... |
|
|
... |
|
i2i |
lri |
lrj |
|
|
lrk |
i2i |
lri |
lrk |
i2i |
lrij |
|
|
... |
|
|
... |
|
add |
lrj ,lrw |
lrx |
add |
lrij ,lrw |
lrx |
add |
lrk,lry |
lrz |
add |
lrk,lry |
lrz |
|
Before |
|
|
After |
|
|
Figure 10.9: Coalescing live ranges |
|
|||
copy operation, making the code smaller and, potentially, faster. It reduces the degree of any lrk that interfered with both lri and lrj . It shrinks the set of live ranges, making I and many of the data structures related to I smaller. Because these e ects help in allocation, coalescing is often done before the coloring stage in a global allocator.
Notice that forming lrij cannot increase the degree of any of its neighbors in I. It can decrease their degree, or leave it unchanged, but it cannot increase their degree.
Figure 10.9 shows an example. The relevant live ranges are lri, lrj , and lrk. In the original code, shown on the left, lrj is live at the definition of lrk, so they interfere. However, neither lrj nor lrk interfere with lri, so both of the copies are candidates for coalescing. The fragment on the right shows the result of coalescing lri and lrj to produce lrij .
Because coalescing two live ranges can prevent subsequent coalescing with other live ranges, order matters. In principle, the compiler should coalesce the most heavily executed copies first. In practice, allocators coalesce copies in order by the loop nesting depth of the block where the copy is found. Coalescing works from deeply nested to least deeply nested, on the theory that this gives highest priority to eliminating copy operations in innermost loops.
To perform coalescing, the allocator walks each block and examines any copy instructions. If the source and destination live ranges do not interfere, it combines them, eliminates the copy, and updates I to reflect the combination. The allocator can conservatively update I to reflect the change by moving all edges from the node for the destination live range to the node representing the source live range. This update, while not precise, allows the allocator to continue coalescing. In practice, allocators coalesce every live range allowed by I, then rewrite the code, rebuild I, and try again. The process typically halts after a couple of rounds of coalescing. It can produce significant reductions in the size of I. Briggs shows examples where coalescing eliminates up to one third of the live ranges.
Figure 10.9 also shows the conservative nature of the interference graph update. Coalescing lri and lrj into lrij actually eliminates an interference.
10.4. GLOBAL REGISTER ALLOCATION AND ASSIGNMENT |
277 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
find live |
- build |
|
|
coalesce |
spill |
|
find a |
|
|
|||
- |
|
|
|
|
|
|
|
||||||||
|
|
|
I |
|
|
|
|
costs |
|
coloring |
|
|
|||
|
|
|
ranges |
|
|
|
|
|
|
no spills |
|||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
? |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
insert |
- |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||
|
|
|
|
spill and iterate |
|
|
|
spills |
reserved |
||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
registers |
|
Figure 10.10: Structure of the coloring allocators
Careful scrutiny of the after fragment reveals that lrij does not interfere with lrk, since lrk is not live at the definition of lrij and the copy defining lrk introduces no interference between its source and its destination. Thus, rebuilding I from the transformed code reveals that, in fact, lrij and lrk can be coalesced. The conservative update left intact the interference between lrj and lrk , so it unnecessarily prevented the allocator from coalescing lrij and lrk.
10.4.7Review and Comparison
Both the top-down and the bottom-up coloring allocator work inside the same basic framework, shown in Figure 10.10. They find live ranges, build the interference graph, coalesce live ranges, compute spill costs on the coalesced version of the code, and attempt a coloring. The build-coalesce process is repeated until it finds no more opportunities. After coloring, one of two situations occurs. If every live range receives a color, then the code is rewritten using physical register names and allocation terminates. If some live ranges remain uncolored, then spill code is inserted.
If the allocator has reserved registers for spilling, then the allocator uses those registers in the spill code, rewrites the colored registers with their physical register names, and the process terminates. Otherwise, the allocator invents new virtual register names to use in spilling and inserts the necessary loads and stores to accomplish the spill. This changes the coloring problem slightly, so the entire allocation process is repeated on the transformed code. When all live ranges have a color, the allocator maps colors onto registers and rewrites the code into its final form.
Of course, a top-down allocator could adopt the spill-and-iterate philosophy used in the bottom-up allocator. This would eliminate the need to reserve registers for spilling. Similarly, a bottom-up allocator could reserve several registers for spilling and eliminate the need for iterating over the entire allocation process. Spill-and-iterate trades additional compile time for a tighter allocation, presumably using less spill code. Reserving registers produces a looser allocation with improved speed.
The top-down allocator uses its priority ranking to order all the constrained nodes. It colors the unconstrained nodes in arbitrary order, since the order cannot change the fact that they receive a color. The bottom-up allocator con-
278 |
CHAPTER 10. REGISTER ALLOCATION |
structs an order in which most nodes are colored in a graph where they are unconstrained. Every node that the top-down allocator classifies as unconstrained is colored by the bottom-up allocator, since it is unconstrained in the original version of I and in each graph derived by removing nodes and edges from I. The bottom-up allocator, using its incremental mechanism for removing nodes and edges, classifies as unconstrained some of the nodes that the top-down allocator treats as constrained. These nodes may also be colored in the top-down allocator; there is no clear way of comparing their performance on these nodes without coding up both algorithms and running them.
The truly hard-to-color nodes are those that the bottom-up allocator removes from the graph with its spill metric. The spill metric is only invoked when every remaining node is constrained. These nodes form a densely connected subset of I. In the top-down allocator, these nodes will be colored in an order determined by their rank or priority. In the bottom-up allocator, the spill metric uses that same ranking, moderated by a measurement of how many other nodes have their degree lowered by each choice. Thus, the top-down allocator chooses to spill low priority, constrained nodes, while the bottom-up allocator spills nodes that are still constrained after all unconstrained nodes have been removed. From this latter set, it picks the node that minimizes the spill metric.
10.4.8Other Improvements to Graph-coloring Allocation
Many variations on these two basic styles of graph-coloring register allocation have appeared in the literature. The first two address the compile-time speed of global allocation. The latter two address the quality of the resulting code.
Imprecise Interference Graphs The original top-down, priority-based allocator used an imprecise notion of interference: live ranges lri and lrj interfere if both are live in the same basic block. This necessitated a prepass that performed local allocation to handle values that are not live across a block boundary. The advantage of this scheme is that building the interference graph is faster. The weakness of the scheme lies in its imprecision. It overestimates the degree of some nodes. It also rules out using the interference graph as a basis for coalescing, since, by definition, two live ranges connected by a copy interfere. (They must be live in the same block if they are connected by a copy operation.)
Breaking the Graph into Smaller Pieces If the interference graph can be separated into components that are not connected, those disjoint components can be colored independently. Since the size of the bit-matrix is O(N 2), breaking it into independent components saves both space and time. One way to split the graph is to consider non-overlapping register classes separately, as with floating-point registers and integer registers. A more complex alternative for large codes is to discover clique separators that divide the interference graph into several disjoint pieces. For large enough graphs, using a hash-table instead of the bit-matrix may improve both speed and space, although the choice of a hash-function has a critical impact on both.
10.4. GLOBAL REGISTER ALLOCATION AND ASSIGNMENT |
279 |
Conservative Coalescing When the allocator coalesces two live ranges, lri and lrj , the new live range, lrij , can be more constrained than either lri or lrj .
If lri and lrj have distinct neighbors, then lr◦ij > max(lr◦i , lr◦j ). If lr◦ij < k, then creating lrij is strictly beneficial. However, if lr◦i < k and lr◦j < k,
but lr◦ij ≥ k, then coalescing lri and lrj can make I harder to color without spilling. To avoid this problem, some compiler writers have used a limited form of coalescing called conservative coalescing. In this scheme, the allocator only combines lri and lrj when lr◦ij < k. This ensures that coalescing lri and lrj does not make the interference graph harder to color.
If the allocator uses conservative coalescing, another improvement is possible. When the allocator reaches a point where every remaining live range is constrained, the basic algorithm selects a spill candidate. An alternative approach is to reapply coalescing at this point. Live ranges that were not coalesced because of the degree of the resulting live range may well coalesce in the reduced graph. Coalescing may reduce the degree of nodes that interfere with both the source and destination of the copy. Thus, this iterated coalescing can remove additional copies and reduce the degree of nodes. It may create one or more unconstrained nodes, and allow coloring to proceed. If it does not create any unconstrained nodes, spilling proceeds as before.
Spilling Partial Live Ranges As described, both global allocators spill entire live ranges. This can lead to overspilling if the demand for registers is low through most of the live range and high in a small region. More sophisticated spilling techniques can find the regions where spilling a live range is productive—that is, the spill frees a register in a region where the register is needed. The splitting scheme described for the top-down allocator achieved this result by considering each block in the spilled live range separately. In a bottom-up allocator, similar results can be achieved by spilling only in the region of interference. One technique, called interference region spilling, identifies a set of live ranges that interfere in the region of high demand and limits spilling to that region [9]. The allocator can estimate the cost of several spilling strategies for the interference region and compare those costs against the standard, spill-everywhere approach. By letting the alternatives compete on an estimated cost basis, the allocator can improve overall allocation.
Live Range Splitting Breaking a live range into pieces can improve the results of coloring-based register allocation. In principle, splitting harnesses two distinct e ects. If the split live ranges have lower degree than the original, they may be easier to color—possibly, unconstrained. If some of the split live ranges have high degree and, therefore, spill, then splitting may prevent spilling other pieces with lower degree. As a final, pragmatic e ect, splitting introduces spills at the points where the live range is broken. Careful selection of those split points can control the placement of some spill code—for example, outside loops rather than inside loops.
Many approaches to splitting have been tried. Section 10.4.4 described an approach that breaks the live range into blocks and coalesces them back to-
280 |
CHAPTER 10. REGISTER ALLOCATION |
gether when doing so does not change the allocator’s ability to assign a color. Several approaches that use properties of the control-flow graph to choose split points have been tried. Results from many have been inconsistent [13], however two particular techniques show promise. A method called zero-cost splitting capitalizes on holes in the instruction schedule to split live ranges and improve both allocation and scheduling [39]. A technique called passive splitting uses a directed interference graph to determine where splits should occur and selects between splitting and spilling based on the estimated cost of each [26].
10.5Regional Register Allocation
Even with global information, the global register allocators sometimes make decisions that result in poor code quality locally. To address these shortcomings, several techniques have appeared that are best described as regional allocators. They perform allocation and assignment over regions larger than a single block, but smaller than the entire program.
10.5.1Hierarchical Register Allocation
The register allocator implemented in the compiler for the first Tera computer uses a hierarchical strategy. It analyzes the control-flow graph to construct a hierarchy of regions, or tiles, that capture the control flow of the program. Tiles include loops and conditional constructs; the tiles form a tree that directly encodes nesting and hierarchy among the tiles.
The allocator handles each tile independently, starting with the leaves of the tile tree. Once a tile has been allocated, summary information about the allocation is available for use in performing allocation for its parent in the tile tree—the enclosing region of code. This hierarchical decomposition makes allocation and spilling decisions more local in nature. For example, a particular live range can reside in di erent registers in distinct tiles. The allocator inserts code to reconcile allocations and assignments at tile boundaries. A preferencing mechanism attempts to reduce the amount of reconciliation code that the allocator inserts.
This approach has two other benefits. First, the algorithm can accommodate customized allocators for di erent tiles. For example, a version of the bottomup local allocator might be used for long blocks, and an allocator designed for software pipelining applied to loops. The default allocator is a bottomup coloring allocator. The hierarchical approach irons out any rough edges between these diverse allocators. Second, the algorithm can take advantage of parallelism on the machine running the compiler. Unrelated tiles can be allocated in parallel; serial dependences arise from the parent-child relationship in the tile tree. On the Tera computer, this reduced the running time of the compiler.
The primary drawback of the hierarchical approach lies in the code generated at the boundaries between tiles. The preferencing mechanisms must eliminate copies and spills at those locations, or else the tiles must be chosen so that those locations execute less frequently than the code inside the tiles. The quality of
10.5. REGIONAL REGISTER ALLOCATION |
281 |
allocation with a hierarchical scheme depends heavily on what happens in these transitional regions.
10.5.2Probabilistic Register Allocation
The probabilistic approach tries to address the shortcomings of graph-coloring global allocators by trying to generalize the principles that underlying the bottom-up, local allocator. The bottom-up, local allocator selects values to spill based on the distance to their next use. This distance can be viewed as an approximation to the probability that the value will remain in a register until its next use; the value with the largest distance has the lowest probability of staying in a register. From this perspective, the bottom-up, local allocator always spills the value least likely to keep its register. (making it a self-fulfilling prophecy? ) The probabilistic technique uses this strong local technique to perform an initial allocation. It then uses a combination of probabilities and estimated benefits to make inter-block allocation decisions.
The global allocation phase of the probabilistic allocator breaks the code into regions that reflect its loop structure. It estimates both the expected benefit from keeping a live range in a register and the probability that the live range will stay in a register throughout the region. The allocator computes a merit rank for each live range as the product of its expected benefit and its global probability. It allocates a register for the highest ranking live range, adjusts probabilities for other live ranges to reflect this decision, and repeats the process. (Allocating lri to a register decreases the probability that conflicting live ranges can receive registers.) Allocation proceeds from inner loops to outer loops, and iterates until no uses remain in the region with probability greater than zero.
Once all allocation decisions have been made, it uses graph coloring to perform assignment. It makes a clear separation of allocation from assignment; allocation is performed using the bottom-up local method, followed by a probabilistic, inter-block analog for global allocation decisions.
10.5.3Register Allocation via Fusion
The fusion-based register allocator presents another model for using regional information to drive global register allocation. The fusion allocator partitions the code into regions, constructs interference graphs for each region, and then fuses regions together to form the global interference graph. A region can be a single block, an arbitrary set of connected blocks, a loop, or an entire function. Regions are connected by control-flow edges.
The first step in the fusion allocator, after region formation, ensures that the interference graph for each region is k-colorable. To accomplish this, the allocator may spill some values inside the region. The second step merges the disjoint interference graphs for the code to form a single global interference graph. This is the critical step in the fusion allocator, because the fusion operator maintains k-colorability. When two regions are fused, the allocator may need to split some live ranges to maintain k-colorability. It relies on the observation that only values live along an edge joining the two regions a ect the new region’s
282 CHAPTER 10. REGISTER ALLOCATION
colorability. To order the fusion operations, it relies on a priority ordering of the edges determined when regions are formed. The final step of the fusion allocator assigns a physical register to each allocated live range. Because the graph-merging operator maintains colorability, the allocator can use any of the standard graph-coloring techniques.
The strength of the fusion allocator lies in the fusion operator. By only splitting live ranges that are live across the edge or edges being combined, it localizes the inter-region allocation decisions. Because fusion is applied in edgepriority order, the code introduced by splitting tends to occur on low-priority edges. To the extent that priorities reflect execution frequency, this should lead to executing fewer allocator-inserted instructions.
Clearly, the critical problem for a fusion-based allocator is region formation. If the regions reflect execution frequencies—that is, group together heavily executed blocks and separate out blocks that execute infrequently, then it can force spilling into those lower frequency blocks. If the regions are connected by edges across which few values are live, the set of instructions introduced for splitting can be kept small. However, if the chosen regions fail to capture some of these properties, the rationalization for the fusion-based approach breaks down.
10.6Harder Problems
This chapter has presented a selection of algorithms that attack problems in register allocation. It has not, however, closed the book on allocation. Many harder problems remain; there is room for improvement on several fronts.
10.6.1Whole-program Allocation
The algorithms presented in this chapter all consider register allocation within the context of a single procedure. Of course, whole programs are built from multiple procedures. If global allocation produces better results than local allocation by considering a larger scope, then should the compiler writer consider performing allocation across entire programs? Just as the problem changes significantly when allocation moves from a single block to an entire procedure, it changes in important ways when allocation moves from single procedures to entire programs. Any whole-program allocation scheme must deal with the following issues.
To perform whole-program allocation, the allocator must have access to the code for the entire program. In practice, this means performing whole-program allocation at link-time. (While whole-program analyses and transformations can be applied before link-time, many issues that complicate an implementation of such techniques can be e ectively side-stepped by performing them at link-time.)
To perform whole-program allocation, the compiler must have accurate estimates of the relative execution frequencies of all parts of the program. Within a procedure, static estimates (such as, “a loop executes 10 times”) have proven to be reasonable approximations to actual behavior. Across the entire program, this may no longer be true.
In moving from a single procedure to a scope that includes multiple proce-
