Cooper K.Engineering a compiler
.pdf
8.4. BOOLEAN AND RELATIONAL VALUES |
221 |
Digression: Short-circuit evaluation as an optimization
Short-circuit evaluation arose naturally from a positional encoding of the value of boolean and relational expressions. On processors that used condition codes to record the result of a comparison and used conditional branches to interpret the condition code, short-circuiting made sense.
As processors include features like conditional move, boolean-valued comparisons, and predicated execution, the advantages of short-circuit evaluation will likely fade. With branch latencies growing, the cost of the conditional branches required for short-circuiting will grow. When the branch costs exceed the savings from avoiding evaluation, short circuiting will no longer be an improvement. Instead, full evaluation would be faster.
When the language requires short-circuit evaluation, as does c, the compiler may need to perform some analysis to determine when it is safe to substitute full evaluation for short-circuiting. Thus, future c compilers may include analysis and transformation to replace short-circuiting with full evaluation, just as compilers in the past have performed analysis and transformation to replace full evaluation with short circuiting.
where rt is known to contain true and rf is known to contain false. The e ect of this instruction is to set r1 to true if condition code register cci has the value <, and to false otherwise.
The conditional move instruction executes in a single cycle. At compile time, it does not break a basic block; this can improve the quality of code produced by local optimization. At execution time, it does not disrupt the hardware mechanisms that prefetch and decode instructions; this avoids potential stalls due to mispredicted branches.
Boolean-valued Comparisons This scheme avoids the condition code entirely. The comparison operation returns a boolean value into either a general purpose register or into a dedicated, single-bit register. The conditional branch takes that result as an argument that determines its behavior.
The strength of this model lies in the uniform representation of boolean and relational values. The compiler never emits an instruction to convert the result of a comparison into a boolean value. It never executes a branch as part of evaluating a relational expression, with all the advantages ascribed earlier to the same aspect of conditional move.
The weakness of this model is that it requires explicit comparisons. Where the condition-code models can often avoid the comparison by arranging to have the condition code set by one of the arithmetic operations, this model requires the comparison instruction. This might make the code longer than under the condition branch model. However, the compiler does not need to have true and false in registers. (Getting them in registers might require one or two loadIs.)
Predicated Execution The architecture may combine boolean-valued comparisons with a mechanism for making some, or all, operations conditional. This
222 |
CHAPTER 8. CODE SHAPE |
Straight Condition Codes |
Conditional Move |
comp |
ra,rb |
cc1 |
comp |
ra,rb |
||||
cbr |
|
LT |
cc1 |
→ L1,L2 |
add |
rc,rd |
||
|
||||||||
L1: add |
rc,rd |
ra |
add |
re,rf |
||||
br |
|
→ Lout |
i2i |
|
< |
cc1,rt1,rt2 |
||
|
|
|||||||
L2: add |
re,rf |
ra |
|
|
|
|
||
Lout: nop |
|
|
|
|
|
|
||
cc1
rt1
rt2
ra
Boolean Compare |
Predicated Execution |
cmp |
|
LT |
ra,rb |
r1 |
cmp |
|
LT |
ra,rb |
|
|
|||||||
cbr |
r1 |
→ L1,L2 |
(r1): add |
rc,rd |
||||
L1: add |
rc,rd |
ra |
(¬r1): add |
re,rf |
||||
br |
|
→ Lout |
|
|
|
|
||
L2: add |
re,rf |
ra |
|
|
|
|
||
Lout: nop |
|
|
|
|
|
|
||
r1
ra
ra
Figure 8.4: Relational Expressions for Control-Flow
technique, called predicated execution, lets the compiler generate code that avoids using conditional branches to evaluate relational expressions. In iloc, we write a predicated instruction by including a predicate expression before the instruction. To remind the reader of the predicate’s purpose, we typeset it in parentheses and follow it with a question mark. For example,
(r17)? add ra,rb rc
indicates an add operation (ra+rb) that executes if and only if r17 contains the value true. (Some architects have proposed machines that always execute the operation, but only assign it to the target register if the predicate is true. As long as the “idle” instruction does not raise an exception, the di erences between these two approaches are irrelevant to our discussion.) To expose the complexity of predicate expressions in the text, we will allow boolean expressions over registers in the predicate field. Actual hardware implementations will likely require a single register. Converting our examples to such a form requires the insertion of some additional boolean operations to evaluate the predicate expression into a single register.
8.4.3Choosing a Representation
The compiler writer must decide when to use each of these representations. The decision depends on hardware support for relational comparisons, the costs of branching (particularly a mispredicted conditional branch), the desirability of short-circuit evaluation, and how the result is used by the surrounding code.
Consider the following code fragment, where the sole use for (a < b) is to alter control-flow in an if–then–else construct.
8.4. BOOLEAN AND RELATIONAL VALUES |
223 |
Straight Condition Codes
comp |
ra, rb |
cc1 |
||
cbr |
|
LT |
cc1 |
→ L1,L2 |
|
||||
L1: comp |
rc, rd |
cc2 |
||
cbr |
|
LT |
cc2 |
→ L3,L2 |
|
||||
L2: loadI |
false |
rx |
||
br |
|
→ L4 |
||
L3: loadI |
true |
rx |
||
br |
|
→ L4 |
||
L4: nop |
|
|
||
Conditional Move
comp |
ra,rb |
cc1 |
||
i2i |
|
< |
cc1,rt,rf |
r1 |
|
||||
comp |
rc,rd |
cc2 |
||
i2i |
|
< |
cc2,rt,rf |
r2 |
|
||||
and |
r1,r2 |
rx |
||
Boolean Compare |
|
Predicated Execution |
||||||||
cmp |
|
LT |
ra, rb |
r1 |
|
cmp |
|
LT |
ra, rb |
r1 |
|
|
|||||||||
cmp |
|
LT |
rc, rd |
r2 |
|
cmp |
|
LT |
rc, rd |
r2 |
|
|
|||||||||
and |
r1, r2 |
rx |
|
and |
r1, r2 |
rx |
||||
Figure 8.5: Relational Expressions for Assignment
if (a < b)
then a ← c + d else a ← e + f
Figure 8.4 shows the code that might be generated under each hardware model. The two examples on the left use conditional branches to implement the if-then-else. Each takes five instructions. The examples on the right avoid branches in favor of some form of conditional execution. The two examples on top use an implicit representation; the value of a < b exists only in cc1, which is not a general purpose register. The bottom two examples create an explicit boolean representation for a < b in r1. The left two examples use the value, implicit or explicit, to control a branch, while the right two examples use the
value to control an assignment.
As a second example, consider the assignment x ← a < b c < d. It appears to be a natural for a numerical representation, because it uses and because the result is stored into a variable. (Assigning the result of a boolean or relational expression to a variable necessitates a numerical representation, at least as the final product of evaluation.) Figure 8.5 shows the code that might result under each of the four models.
Again, the upper two examples use condition codes to record the result of a comparison, while the lower two use boolean values stored in a register. The left side shows the simpler version of the scheme, while the right side adds a form of conditional operation. The bottom two code fragments are shortest; they are identical because predication has no direct use in the chosen assignment. Conditional move produces shorter code than the straight condition code scheme. Presumably, the branches are slower than the comparisons, so the code is faster, too. Only the straight condition code scheme performs shortcircuit evaluation.
224 |
CHAPTER 8. CODE SHAPE |
8.5Storing and Accessing Arrays
So far, we have assumed that variables stored in memory are scalar values. Many interesting programs use arrays or similar structures. The code required to locate and reference an element of an array is surprisingly complex. This section shows several schemes for laying out arrays in memory and describes the code that each scheme produces for an array reference.
8.5.1Referencing a Vector Element
The simplest form of an array has a single dimension; we call a one-dimensional array a vector. Vectors are typically stored in contiguous memory, so that the ith element immediately precedes the i + 1st element. Thus, a vector V[3..10] generates the following memory layout.
V[3..10]
3
4
5
6
7
8
9
10
6
@V
When the compiler encounters a reference, like V[6], it must use the index into the vector, along with facts available from the declaration of V, to generate an o set for V[6]. The actual address is then computed as the sum of the o set and a pointer to the start of V, which we write as @V.
As an example, assume that V has been declared as V[low..high], where low and high are the lower and upper bounds on the vector. To translate the reference V[i], the compiler needs both a pointer to the start of storage for V and the o set of element i within V. The o set is simply (i − low) × w, where w is the length of a single element of V. Thus, if low is 3 and i is 6, the o set is (6 − 3) × 4 = 12. The following code fragment computes the correct address into ra.
loadI |
@i |
r1 |
// @i |
is i’s |
address |
subI |
r1, 3 |
r2 |
// (offset - |
lower bound) |
|
multI |
r2, 4 |
r3 |
// × element length |
||
addI |
r3, @V |
r4 |
// @V |
is V’s address |
|
load |
r4 |
rv |
|
|
|
Notice that the textually simple reference V[i] introduces three arithmetic operations. These can be simplified. Forcing a lower bound of zero eliminates the subtraction; by default, vectors in c have zero as their lower bound. If the element length is a power of two, the multiply can be replaced with an arithmetic shift; most element lengths have this property. Adding the address and o set seems unavoidable; perhaps this explains why most processors include an address mode that takes a base address and an o set and accesses the location at base address + o set.2 We will write this as loadAO in our examples. Thus, there are obvious ways of improving the last two operations.
2Since the compiler cannot eliminate the addition, it has been folded into hardware.
8.5. STORING AND ACCESSING ARRAYS |
225 |
If the lower bound for an array is known at compile-time, the compiler can fold the adjustment for the vector’s lower bound into its address. Rather than letting @V point directly to the start of storage for V, the compiler can use @V0, computed as @V − low × w. In memory, this produces the following layout.
V[3..10]
3
4
5
6
7
8
9
10
66
@V0 @V
We sometimes call @V0 the “false zero” of V. If the bounds are not known at compile-time, the compiler might calculate V0 as part of its initialization activity and reuse that value in each reference to V. If each call to the procedure executes one or more references to V, this strategy is worth considering.
Using the false zero, the code for accessing V[i] simplifies to the following sequence:
loadI |
@V0 |
r@V |
load |
@i |
r1 |
lshiftI |
r1, 2 |
r2 |
loadAO |
r@V , r2 |
rV |
//adjusted address for V
//@i is i’s address
//× element length
This eliminates the subtraction by low. Since the element length, w, is a power of two, we also replaced the multiply with a shift. More context might produce additional improvements. If either V or i appears in the surrounding code, then @V0 and i may already reside in registers. This would eliminate one or both of the loadi instructions, further shortening the instruction sequence.
8.5.2Array Storage Layout
Accessing a multi-dimensional array requires more work. Before discussing the code sequences that the compiler must generate, we must consider how the compiler will map array indices into memory locations. Most implementations use one of three schemes: row-major order, column-major order, or indirection vectors. The source language definition usually specifies one of these mappings.
The code required to access an array element depends on the way that the array is mapped into memory. Consider the array A[1..2,1..4]. Conceptually, it looks like
A |
1,1 |
1,2 |
1,3 |
1,4 |
|
2,1 |
2,2 |
2,3 |
2,4 |
||
|
In linear algebra, the row of a two-dimensional matrix is its first dimension, and the column is its second dimension. In row-major order, the elements of A are mapped onto consecutive memory locations so that adjacent elements of a single row occupy consecutive memory locations. This produces the following layout.
1,1 |
1,2 |
1,3 |
1,4 |
2,1 |
2,2 |
2,3 |
2,4 |
226 |
CHAPTER 8. CODE SHAPE |
The following loop nest shows the e ect of row-major order on memory access patterns.
for i ← 1 to 2 for j ← 1 to 4
A[i,j] ← A[i,j] + 1
In row-major order, the assignment statement steps through memory in sequential order, beginning with A[1,1] and ending with A[2,4]. This kind of sequential access works well with most memory hierarchies. Moving the i loop inside the j loop produces an access sequence that jumps between rows, accessing A[1,1], A[2,1], A[1,2], . . . , A[2,4]. With a small array like A, this is not a problem. With larger arrays, the lack of sequential access could produce poor performance in the memory hierarchy. As a general rule, row-major order produces sequential access when the outermost subscript varies fastest.
The obvious alternative to row-major order is column-major order. It keeps the columns of A in contiguous locations, producing the following layout.
1,1 |
2,1 |
1,2 |
2,2 |
1,3 |
,2,3 |
1,4 |
2,4 |
Column major order produces sequential access when the innermost subscript varies fastest. In our doubly-nested loop, moving the i loop to the innermost position produces sequential access, while having the j loop inside the i loop produces non-sequential access.
A third alternative, not quite as obvious, has been used in several languages. This scheme uses indirection vectors to reduce all multi-dimensional arrays to a set of vectors. For our array A, this would produce
A: 1,1 1,2 1,3 1,4
XzX 2,1 2,2 2,3 2,4
Each row has its own contiguous storage. Within a row, elements are addressed as in a vector (see Section 8.5.1). To allow systematic addressing of the row vectors, the compiler allocates a vector of pointers and initializes it appropriately.
This scheme appears simple, but it introduces two kinds of complexity. First, it requires more storage than the simpler row-major or column-major layouts. Each array element has a storage location; additionally, the inner dimensions require indirection vectors. The number of vectors can grow quadratically in the array’s dimension. Figure 8.6 shows the layout for a more complex array, B[1..2,1..3,1..4]. Second, a fair amount of initialization code is required to set up all the pointers for the array’s inner dimensions.
Each of these schemes has been used in a popular programming language. For languages that store arrays in contiguous storage, row-major order has been the typical choice; the one notable exception is Fortran, which used columnmajor order. Both bcpl and c use indirection vectors; c sidesteps the initialization issue by requiring the programmer to explicitly fill in all of the pointers.
8.5. STORING AND ACCESSING ARRAYS |
|
227 |
||||||||||||
|
|
|
|
|
|
: |
|
|
|
|
|
|||
|
|
|
|
|
|
1,1,1 |
1,1,2 |
1,1,3 |
1,1,4 |
|
||||
|
|
|
|
|
||||||||||
|
|
|
|
|
|
|
|
- |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||
|
|
|
|
1,2,1 |
1,2,2 |
1,2,3 |
1,2,4 |
|
||||||
|
|
|
|
|
|
|||||||||
|
|
|
X |
|
|
|
|
|
||||||
B |
|
|
|
|
|
1,3,1 |
1,3,2 |
1,3,3 |
1,3,4 |
|
||||
|
|
|
|
|
|
XzX |
|
|||||||
|
|
|
|
|
|
|||||||||
|
|
|
|
|
: |
|
|
|
|
|
||||
|
|
|
|
|
|
|||||||||
|
|
|
- |
|
2,1,1 |
2,1,2 |
2,1,3 |
2,1,4 |
|
|||||
|
|
|
|
|
||||||||||
|
|
|
||||||||||||
|
|
|
|
|
|
|
|
- |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2,2,1 |
2,2,2 |
2,2,3 |
2,2,4 |
|
||
|
|
|
|
|
|
|
|
|
|
|||||
|
|
|
|
|
|
|
XXzX |
|
|
|
|
|
||
|
|
|
|
|
|
2,3,1 |
2,3,2 |
2,3,3 |
2,2,4 |
|
||||
|
|
|
|
|
|
|
|
|
||||||
Figure 8.6: Indirection vectors for B[1..2,1..3,1..4]
8.5.3Referencing an Array Element
Computing an address for a multi-dimensional array requires more work. It also requires a commitment to one of the three storage schemes described in Section 8.5.2.
Row-major Order In row-major order, the address calculation must find the start of the row and then generate an o set within the row as if it were a vector. Recall our example of A[1..2,1..4]. To access element A[i,j], the compiler must emit code that computes the address of row i, and follow that with the o set for element j, which we know from Section 8.5.1 will be (j − low2) × w. Each row contains 4 elements, computed as high2 − low2 + 1, where high2 is the the highest numbered column and low2 is the lowest numbered column— the upper and lower bounds for the second dimension of A. To simplify the exposition, let len2 = high2 − low2 + 1. Since rows are laid out consecutively, row i begins at (i − low1) × len2 × w from the start of A. This suggests the address computation:
@A + (i − low1) × len2 × w + (j − low2) × w
Substituting actual values in for i, j, low1, high2, low2, and w, we find that A[2,3] lies at o set
((2 − 1) × 4 + (3 − 1)) × 4 = 24
from A[0,0]. (A actually points to the first element, at o set 0.) Looking at A in memory, we find that A[0,0] + 24 is, in fact, A[2,3].
1,1 |
1,2 |
1,3 |
1,4 |
2,1 |
2,2 |
2,3 |
2,4 |
In the vector case, we were able to simplify the calculation when upper and lower bounds were known at compile time. Applying the same algebra to adjust the base address in the two-dimensional case produces
@A + (i × len2 × w) − (low1 × len2 × w) + (j × w) − (low2 × w), or
@A + (i × len2 × w) + (j × w) − (low1 × len2 × w + low2 × w)
The last term, (low1 × len2 × w + low2 × w), is independent of i and j, so it can be factored directly into the base address to create
228 |
CHAPTER 8. CODE SHAPE |
@A0 = @A − (low1 × len2 × w + low2 × w)
This is the two-dimensional analog of the transformation that created a false zero for vectors in Section 8.5.1. Then, the array reference is simply
@A0 + i × len2 × w + j × w
Finally, we can re-factor to move the w outside, saving extraneous multiplies.
@A0 + (i × len2 + j) × w
This form of the polynomial leads to the following code sequence:
load |
@i |
ri |
// i’s value |
load |
@j |
rj |
// j’s value |
loadI |
@A0 |
r@a |
// adjusted base for A |
multI |
ri, len2 |
r1 |
// i × len2 |
add |
r1, rj |
r2 |
// + j |
lshiftI |
r2, 2 |
r3 |
// × 4 |
loadAO |
r@a, r3 |
rv |
|
In this form, we have reduced the computation to a pair of additions, one multiply, and one shift. Of course, some of i, j, and @A0 may be in registers.
If we do not know the array bounds at compile-time, we must either compute the adjusted base address at run-time, or use the more complex polynomial that includes the subtractions that adjust for lower bounds.
@A + ((i − low1) × len2 + j − low2) × w
In this form, the code to evaluate the addressing polynomial will require two additional subtractions.
To handle higher dimensional arrays, the compiler must generalize the address polynomial. In three dimensions, it becomes
base address + w ×
(((index1 − low1) × len2 + index2 − low2) × len3) + index3 − low3)
Further generalization is straight forward.
Column-major Order Accessing an array stored in column-major order is similar to the case for row-major order. The di erence in calculation arises from the di erence in storage order. Where row-major order places entire rows in contiguous memory, column-major order places entire columns in contiguous memory. Thus, the address computation considers the individual dimensions in the opposite order. To access our example array, A[1..2,1..4], when it is stored in column major order, the compiler must emit code that finds the starting address for column j and compute the vector-style o set within that column for element i. The start of column j occurs at o set (j − low2) × len1 × w from the start of A. Within the column, element i occurs at (i − low1) × w. This leads to an address computation of
@A + ((j − low2) × len1 + i − low1) × w
Substituting actual values for i, j, low1, low2, len1, and w, A[2,3] becomes
8.5. STORING AND ACCESSING ARRAYS |
229 |
@A + ((3 − 1) × 2 + (2 − 1)) × 4 = 20,
so that A[2,3] is 20 bytes past the start of A. Looking at the memory layout from Section 8.5.2, we see that A + 20 is, indeed, A[2,3].
1,1 |
2,1 |
1,2 |
2,2 |
1,3 |
2,3 |
1,4 |
2,4 |
The same manipulations of the addressing polynomial that applied for rowmajor order work with column-major order. We can also adjust the base address to compensate for non-zero lower bounds. This leads to a computation of
@A0 + (j × len1 + i) × w
for the reference A[i,j] when bounds are known at compile time and
@A0 + ((j − low2) × len1 + i − low1) × w
when the bounds are not known.
For a three-dimensional array, this generalizes to
base address + w ×
(((index3 − low3) × len2 + index2 − low2) × len1) + index1 − low1)
The address polynomials for higher dimensions generalize along the same lines as for row-major order.
Indirection Vectors Using indirection vectors simplifies the code generated to access an individual element. Since the outermost dimension is stored as a set of vectors, the final step looks like the vector access described in Section 8.5.1. For B[i,j,k], the final step computes an o set from k, the outermost dimension’s lower bound, and the length of an element for B. The preliminary steps derive the starting address for this vector by following the appropriate pointers through the indirection vector structure.
Thus, to access element B[i,j,k] in the array B shown in Figure 8.6, the compiler would use B, i, and the length of a pointer (4), to find the vector for the subarray B[i,*,*]. Next, it would use that result, along with j and the length of a pointer to find the vector for the subarray B[i,j,*]. Finally, it uses the vector address computation for index k, and element length w to find B[i,j,k] in this vector.
If the current values for i, j, and k exist in registers ri, rj , and rk, respectively, and that @B0 is the zero-adjusted address of the first dimension, then
B[i,j,k] can be referenced as follows. |
|
|||
loadI |
@B0 |
r@B |
// assume zero-adjusted pointers |
|
lshiftI |
ri, 2 |
r1 |
// pointer is 4 |
bytes |
loadAO |
r@B , r1 |
r2 |
|
|
lshiftI |
rj , 2 |
r3 |
// pointer is 4 |
bytes |
loadAO |
r2, r3 |
r4 |
// vector code from § 8.5.1 |
|
lshiftI |
rk, 2 |
r5 |
|
|
loadAO |
r4, r5 |
r6 |
|
|
This code assumes that the pointers in the indirection structure have already been adjusted to account for non-zero lower bounds. If the pointers have not
230 |
CHAPTER 8. CODE SHAPE |
been adjusted, then the values in rj and rk must be decremented by the corresponding lower bounds.
Using indirection vectors, the reference requires just two instructions per dimension. This property made the indirection vector implementation of arrays e cient on systems where memory access was fast relative to arithmetic—for example, on most computer systems prior to 1985. Several compilers used indirection vectors to manage the cost of address arithmetic. As the cost of memory accesses has increased relative to arithmetic, this scheme has lost its advantage. If systems again appear where memory latencies are small relative to arithmetic, indirection vectors may again emerge as a practical way to decrease access costs.3
Accessing Array-valued Parameters When an array is passed as a parameter, most implementations pass it by reference. Even in languages that use call-by- value for all other parameters, arrays are usually passed by reference. Consider the mechanism required to pass an array by value. The calling procedure would need to copy each array element value into the activation record of the called procedure. For all but the smallest arrays, this is impractical. Passing the array as a reference parameter can greatly reduce the cost of each call.
If the compiler is to generate array references in the called procedure, it needs information about the dimensions of the array bound to the parameter. In Fortran, for example, the programmer is required to declare the variable using either constants or other formal parameters to specify its dimensions. Thus, Fortran places the burden for passing information derived from the array’s original declaration to the called procedure. This lets each invocation of the procedure use the correct constants for the array that it is passed.
Other languages leave the task of collecting, organizing, and passing the necessary information to the compiler. This approach is necessary if the array’s size cannot be statically determined—that is, it is allocated at run-time. Even when the size can be statically determined, this approach is useful because it abstracts away details that would otherwise clutter code. In these circumstances, the compiler builds a descriptor that contains both a pointer to the start of the array and the necessary information on each dimension. The descriptor has a known size, even when the array’s size cannot be known at compile time. Thus, the compiler can allocate space for the descriptor in the ar of the called procedure. The value passed in the array’s parameter slot is a pointer to this descriptor. For reasons lost in antiquity, we call this descriptor a dope vector.
When the compiler generates a reference to an array that has been passed as a parameter, it must draw the information out of the dope vector. It generates the same address polynomial that it would use for a reference to a local array, loading values out of the dope vector as needed. The compiler must decide, as a matter of policy, which form of the addressing polynomial it will use. With the naive address polynomial, the dope vector must contain a pointer to the start of
3On cache-based machines, locality is critical to performance. There is little reason to believe that indirection vectors have good locality. It seems more likely that this scheme generates a reference stream that appears random to the memory system.
