Cooper K.Engineering a compiler
.pdf
11.5. MORE AGGRESSIVE TECHNIQUES |
|
|
|
|
|
313 |
||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
entry |
|
|
exit- |
entry- |
|
|
exit- |
exit |
|||||
|
|
- B |
1 |
|
|
|
B1 |
|
|
|
|
6 |
||
|
|
|
|
|
|
|
|
|
||||||
|
|
|
|
|
|
|
|
|
|
|
||||
|
|
|
|
? |
|
|
|
|
? |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||
|
|
|
|
|
|
|
|
|
|
|
|
|||
|
|
|
B2 |
|
|
- B2 |
- |
B1 |
|
|
||||
|
|
|
|
|
|
|
|
|||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||||
|
Original Code |
|
After Cloning |
|
|
|||||||||
Figure 11.11: Cloning a Tail-call
to avoid excessive code growth and to avoid unwinding loops. A typical implementation might clone blocks within an innermost loop, stopping when it reaches a loop-closing edge (sometimes called a back-edge in the control-flow graph). This creates a situation where the only multiple-entry block in the loop is the first block in the loop. All other paths have only single-entry blocks.
A second example that merits consideration arises in tail-recursive programs. Recall, from Section 8.8.2, that a program is tail recursive if its last action is a recursive self-invocation. When the compiler detects a tail-call, it can convert the call into a branch back to the procedure’s entry. This implements the recursion with an ad hoc looping construct rather than a full-weight procedure call. From the scheduler’s point of view, however, cloning may improve the situation.
The left side of Figure 11.11 shows an abstracted control-flow graph for a tail-recursive routine, after the tail-call has been converted to iteration. Block B1 is entered along two paths, the path from the procedure entry and the path from B2. This forces the scheduler to use worst-case assumptions about what precedes B1. By cloning B1, as shown on the right, the compiler can create the situation where control enters B1 along only one edge. This may improve the results of regional scheduling, with an ebb scheduler or a loop scheduler. To further simplify the situation, the compiler might coalesce B1 onto the end of B2, creating a single block loop body. The resulting loop can be scheduled with either a local scheduler or a loop scheduler, as appropriate.
11.5.2Global Scheduling
Of course, the compiler could attempt to take a global approach to scheduling, just as compilers take a global approach to register allocation. Global scheduling schemes are a cross between code motion, performed quite late, and regional scheduling to handle the details of instruction ordering. They require a global graph to represent precedence; it must account for the flow of data along all possible control-flow paths in the program.
These algorithms typically determine, for each operation, the earliest position in the control-flow graph that is consistent with the constraints represented in the precedence graph. Given that location, they may move the operation later in the control-flow graph to the deepest location that is controlled by the same
314 |
CHAPTER 11. INSTRUCTION SCHEDULING |
Digression: Measuring Run-time Performance
The primary goal of instruction scheduling is to improve the running time of the generated code. Discussions of performance use many di erent metrics; the two most common are:
Operations per second The metric commonly used to advertise computers and to compare system performance is the number of operations executed in a second. This can be measured as instructions issued per second or instructions retired per second.
Time to complete a fixed task This metric uses one or more programs whose behavior is known, and compares the time required to complete these fixed tasks. This approach, called benchmarking, provides information about overall system performance, both hardware and software, on a particular workload.
No single metric contains enough information to allow evaluation of the quality of code generated by the compiler’s back end. For example, if the measure is operations per second, does the compiler get extra credit for leaving extraneous (but independent) operations in code? The simple timing metric provides no information about what is achievable for the program. Thus, it allows one compiler to do better than another, but fails to show the distance between the generated code and what is optimal for that code on the target machine.
Numbers that the compiler writer might want to measure include the percentage of executed instructions whose output is actually used, and the percentage of cycles spent in stalls and interlocks. The former gives insight into some aspects of predicated execution, while the latter directly measures some aspects of schedule quality.
set of conditions. The former heuristic is intended to move operations out of deeply nested loops and into less frequently executed positions. Moving the operation earlier in the control-flow graph often increases, at least temporarily, the size of its live range. That leads to the latter heuristic, which tries to move the operation as close to its subsequent use without moving it any deeper in the nesting of the control-flow graph.
The compiler writer may be able to achieve similar results in a simpler way—using a specialized algorithm to perform code motion (such as Lazy Code Motion, described in Chapter 14) followed by a strong local or regional scheduler. The arguments for global optimization and for global allocation may not carry over to scheduling; the emphasis on scheduling is to avoid stalls, interlocks, and nops. These latter issues tend to be localized in their impact.
11.6. SUMMARY AND PERSPECTIVE |
315 |
11.6Summary and Perspective
Algorithms that guarantee optimal schedules exist for simplified situations. For example, on a machine with one functional unit and uniform operation latencies, the Sethi-Ullman labelling algorithm creates an optimal schedule for an expression tree [47]. It can be adapted to produce good code for expression dags. Fischer and Proebsting built on the labelling algorithm to derive an algorithm that produces optimal or near optimal results for small memory latencies [43]. Unfortunately, its has trouble when either the number of functional units or their latencies rise.
In practice, modern computers have become complex enough that none of the simplified models adequately reflect their behavior. Thus, most compilers use some form of list scheduling. The algorithm is easily adapted and parameterized; it can be run for forward scheduling and backward scheduling. The technology of list scheduling is the base from which more complex schedulers, like software pipeliners, are built.
Techniques that operate over larger regions have grown up in response to real problems. Trace scheduling was developed for vliw architectures, where the compiler needed to keep many functional units busy. Techniques that schedule extended basic blocks and loops are, in essence, responses to the increase in both the number of pipelines that the compiler must consider and their individual latencies. As machines have become more complex, schedulers have needed a larger scheduling context to discover enough instruction-level parallelism to keep the machines busy.
The example for backward versus forward scheduling in Figure 11.4 was brought to our attention by Philip Schielke [46]. It is from the Spec benchmark program go. It captures, concisely, an e ect that has caused many compiler writers to include both forward and backward schedulers in their back ends.
334 |
CHAPTER 11. INSTRUCTION SCHEDULING |
336 |
|
|
APPENDIX A. ILOC |
operation |
→ |
opcode operand-list operand-list |
|
operand-list |
→ |
operand |
|
|
| |
operand , operand-list |
|
operand |
→ |
register |
|
|
| |
number |
|
|
| |
label |
|
Operands come in three types: register, number, and label. The type of each operand is determined by the opcode and the position of the operand in the operation. In the examples, we make this textually obvious by beginning all register operands with the letter r and all labels with a letter other than r (typically, with an l).
We assume that source operands are read at the beginning of the cycle when the operation issues and that target operands are defined at the end of the cycle in which the operation completes.
Most operations have a single target operand; some of the store operations have multiple target operations. For example, the storeAI operation has a single source operand and two target operands. The source must be a register, and the targets must be a register and an immediate constant. Thus, the iloc operation
storeAI ri rj , 4
computes an address by adding 4 to the contents of rj and stores the value found in ri into the memory location specified by the address. In other words,
Memory(rj +4) ← Contents(ri)
The non-terminal opcode can be any of the iloc operation codes. Unfortunately, as in a real assembly language, the relationship between an opcode and the form of its arguments is less than systematic. The easiest way to specify the form of each opcode is in a tabular form. Figure A.2 at the end of this appendix shows the number of operands and their types for each iloc opcode used in the book.
As a lexical matter, iloc comments begin with the string // and continue until the end of a line. We assume that these are stripped out by the scanner; thus they can occur anywhere in an instruction and are not mentioned in the grammar.
To make this discussion more concrete, let’s work through the example used in Chapter 1. It is shown in Figure A.1. To start, notice the comments on the right edge of most lines. In our iloc-based systems, comments are automatically generated by the compiler’s front end, to make the iloc code more readable by humans. Since the examples in this book are intended primarily for humans, we continue this tradition of annotating the iloc
This example assumes that register rsp holds the address of a region in memory where the variables w, x, y, and z are stored at o sets 0, 8, 16, and 24, respectively.
337
loadAI |
rsp, 0 |
rw |
// w is at offset |
0 from rsp |
loadI |
2 |
r2 |
// constant 2 into r2 |
|
loadAI |
rsp, 8 |
rx |
// x is at offset |
8 |
loadAI |
rsp, 12 |
ry |
// y is at offset |
12 |
loadAI |
rsp, 16 |
rz |
// z is at offset |
16 |
mult |
rw , r2 |
rw |
// rw ← w×2 |
|
mult |
rw , rx |
rw |
// rw ← (w×2) × x |
|
mult |
rw , ry |
rw |
// rw ← (w×2×x) × y |
|
mult |
rw , rz |
rw |
// rw ← (w×2×x×y) × z |
|
storeAI |
rw |
rsp, 0 |
// write rw back to ’w’ |
|
Figure A.1: Introductory example, revisited
The first instruction is a loadAI operation, or a load address-immediate. From the opcode table, we can see that it combines the contents of rsp with the immediate constant 0 and retrieves the value found in memory at that address. We know, from above, that this value is w. It stores the retrieved value in rw . The next instruction is a loadi operation, or a load immediate. It moves the value 2 directly into r2. (E ectively, it reads a constant out of the instruction stream and into a register.) Instructions three through five load the values of x into rx, y into ry , and z into rz .
The sixth instruction multiplies the contents of rw and r2, storing the result back into rw. Instruction seven multiplies this quantity by rx. Instruction eight multiplies in ry , and instruction nine picks up rz . In each instruction from six through nine, the value is accumulated into rw .
Finally, instruction ten saves the value to memory. It uses a storeAI, or store address-immediate, to write the contents of rw into the memory location at o set 0 from rsp. As pointed out in Chapter 1, this sequence evaluates the expression
w ← w × 2 × x × y × z
The opcode table, at the end of this appendix, lists all of the opcodes used in iloc examples in the book.
Comparison and Conditional Branch In general, the iloc comparison operators take two values and return a boolean value. The operations cmp LT, cmp LE, cmp EQ, cmp NE, cmp GE, and cmp GT work this way. The corresponding conditional branch, cbr, takes a boolean as its argument and transfers control to one of two target labels. The first label is selected if the boolean is true; the second label is selected if the label is false.
Using two labels on the conditional branch has two advantages. First, the code is somewhat more concise. In several situations, a conditional branch might be followed by an absolute branch. The two-label branch lets us record that combination in a single operation. Second, the code is easier to manipulate. A single-label conditional branch implies some positional relationship with the next instruction; there is an implicit connection between the branch and its
338 |
APPENDIX A. ILOC |
“fall-through” path. The compiler must take care, particularly when reading and writing linear code, to preserve these relationships. The two-label conditional branch makes this implicit connection explicit, and removes any possible positional dependence.
Because the two branch targets are not “defined” by the instruction, we change the syntax slightly. Rather than use the arrow, we write branches with the smaller → arrow.
In a few places, we want to discuss what happens when the comparison returns a complex value written into a designated area for a “condition code.” The condition code is a multi-bit value that can only be interpreted with a more complex conditional branch instruction. To talk about this mechanism, we use an alternate set of comparison and conditional branch operators. The comparison operator, comp takes two values and sets the condition code appropriately. We always designate the target of comp as a condition code register by writing it cci. The corresponding conditional branch has six variants, one for each comparison result. Figure A.3 shows these instructions and their meaning.
Note: In a real compiler that used iloc, we would need to introduce some representation for distinct data types. The research compiler that we built using iloc had several distinct data types—integer, single-precision floatingpoint, double-precision floating-point, complex, and pointer.
A.0.1 Naming Conventions
1.Memory o sets for variables are represented symbolically by prefixing the variable name with the @ character.
2.The user can assume an unlimited supply of registers. These are named with simple integers, as in r1776.
3.The register r0 is reserved as a pointer the current activation record. Thus, the operation loadAI r0,@x r1 implicitly exposes the fact that the variable x is stored in the activation record of the procedure containing the operation.
A.0.2 Other Important Points
An iloc operation reads its operands at the start of the cycle in which it issues. It writes its target at the end of the cycle in which it finishes. Thus, two operations in the same instruction can both refer to a given register. Any uses receive the value defined at the start of the cycle. Any definitions occur at the end of the cycle. This is particularly important in Figure 11.9.
