Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Cooper K.Engineering a compiler

.pdf
Скачиваний:
53
Добавлен:
23.08.2013
Размер:
2.31 Mб
Скачать

11.5. MORE AGGRESSIVE TECHNIQUES

 

 

 

 

 

313

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

entry

 

 

exit-

entry-

 

 

exit-

exit

 

 

- B

1

 

 

 

B1

 

 

 

 

6

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

?

 

 

 

 

?

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

B2

 

 

- B2

-

B1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Original Code

 

After Cloning

 

 

Figure 11.11: Cloning a Tail-call

to avoid excessive code growth and to avoid unwinding loops. A typical implementation might clone blocks within an innermost loop, stopping when it reaches a loop-closing edge (sometimes called a back-edge in the control-flow graph). This creates a situation where the only multiple-entry block in the loop is the first block in the loop. All other paths have only single-entry blocks.

A second example that merits consideration arises in tail-recursive programs. Recall, from Section 8.8.2, that a program is tail recursive if its last action is a recursive self-invocation. When the compiler detects a tail-call, it can convert the call into a branch back to the procedure’s entry. This implements the recursion with an ad hoc looping construct rather than a full-weight procedure call. From the scheduler’s point of view, however, cloning may improve the situation.

The left side of Figure 11.11 shows an abstracted control-flow graph for a tail-recursive routine, after the tail-call has been converted to iteration. Block B1 is entered along two paths, the path from the procedure entry and the path from B2. This forces the scheduler to use worst-case assumptions about what precedes B1. By cloning B1, as shown on the right, the compiler can create the situation where control enters B1 along only one edge. This may improve the results of regional scheduling, with an ebb scheduler or a loop scheduler. To further simplify the situation, the compiler might coalesce B1 onto the end of B2, creating a single block loop body. The resulting loop can be scheduled with either a local scheduler or a loop scheduler, as appropriate.

11.5.2Global Scheduling

Of course, the compiler could attempt to take a global approach to scheduling, just as compilers take a global approach to register allocation. Global scheduling schemes are a cross between code motion, performed quite late, and regional scheduling to handle the details of instruction ordering. They require a global graph to represent precedence; it must account for the flow of data along all possible control-flow paths in the program.

These algorithms typically determine, for each operation, the earliest position in the control-flow graph that is consistent with the constraints represented in the precedence graph. Given that location, they may move the operation later in the control-flow graph to the deepest location that is controlled by the same

314

CHAPTER 11. INSTRUCTION SCHEDULING

Digression: Measuring Run-time Performance

The primary goal of instruction scheduling is to improve the running time of the generated code. Discussions of performance use many di erent metrics; the two most common are:

Operations per second The metric commonly used to advertise computers and to compare system performance is the number of operations executed in a second. This can be measured as instructions issued per second or instructions retired per second.

Time to complete a fixed task This metric uses one or more programs whose behavior is known, and compares the time required to complete these fixed tasks. This approach, called benchmarking, provides information about overall system performance, both hardware and software, on a particular workload.

No single metric contains enough information to allow evaluation of the quality of code generated by the compiler’s back end. For example, if the measure is operations per second, does the compiler get extra credit for leaving extraneous (but independent) operations in code? The simple timing metric provides no information about what is achievable for the program. Thus, it allows one compiler to do better than another, but fails to show the distance between the generated code and what is optimal for that code on the target machine.

Numbers that the compiler writer might want to measure include the percentage of executed instructions whose output is actually used, and the percentage of cycles spent in stalls and interlocks. The former gives insight into some aspects of predicated execution, while the latter directly measures some aspects of schedule quality.

set of conditions. The former heuristic is intended to move operations out of deeply nested loops and into less frequently executed positions. Moving the operation earlier in the control-flow graph often increases, at least temporarily, the size of its live range. That leads to the latter heuristic, which tries to move the operation as close to its subsequent use without moving it any deeper in the nesting of the control-flow graph.

The compiler writer may be able to achieve similar results in a simpler way—using a specialized algorithm to perform code motion (such as Lazy Code Motion, described in Chapter 14) followed by a strong local or regional scheduler. The arguments for global optimization and for global allocation may not carry over to scheduling; the emphasis on scheduling is to avoid stalls, interlocks, and nops. These latter issues tend to be localized in their impact.

11.6. SUMMARY AND PERSPECTIVE

315

11.6Summary and Perspective

Algorithms that guarantee optimal schedules exist for simplified situations. For example, on a machine with one functional unit and uniform operation latencies, the Sethi-Ullman labelling algorithm creates an optimal schedule for an expression tree [47]. It can be adapted to produce good code for expression dags. Fischer and Proebsting built on the labelling algorithm to derive an algorithm that produces optimal or near optimal results for small memory latencies [43]. Unfortunately, its has trouble when either the number of functional units or their latencies rise.

In practice, modern computers have become complex enough that none of the simplified models adequately reflect their behavior. Thus, most compilers use some form of list scheduling. The algorithm is easily adapted and parameterized; it can be run for forward scheduling and backward scheduling. The technology of list scheduling is the base from which more complex schedulers, like software pipeliners, are built.

Techniques that operate over larger regions have grown up in response to real problems. Trace scheduling was developed for vliw architectures, where the compiler needed to keep many functional units busy. Techniques that schedule extended basic blocks and loops are, in essence, responses to the increase in both the number of pipelines that the compiler must consider and their individual latencies. As machines have become more complex, schedulers have needed a larger scheduling context to discover enough instruction-level parallelism to keep the machines busy.

The example for backward versus forward scheduling in Figure 11.4 was brought to our attention by Philip Schielke [46]. It is from the Spec benchmark program go. It captures, concisely, an e ect that has caused many compiler writers to include both forward and backward schedulers in their back ends.

334

CHAPTER 11. INSTRUCTION SCHEDULING

Appendix A

ILOC

Introduction

Iloc is the linear assembly code for a simple abstract machine. The iloc abstract machine is a risc-like architecture. We have assumed, for simplicity, that all operations work on the same type of data—sixty-four bit integer data.

It has an unlimited number of registers. It has a couple of simple addressing modes, load and store operations, and three address register-to-register operators.

An iloc program consists of a sequential list of instructions. An instruction may have a label; a label is followed immediately by a colon. If more than one label is needed, we represent it in writing by adding the special instruction nop that performs no action. Formally:

iloc-program instruction-list

instruction-list instruction

|label : instruction

|instruction instruction-list

Each instruction contains one or more operations. A single-operation instruction is written on a line of its own, while a multi-operation instruction can span several lines. To group operations into a single instruction, we enclose them in square brackets and separate them with semi-colons. More formally:

instruction operation

|[ operation-list ]

operation-list operation

|operation ; operation-list

An iloc operation corresponds to an instruction that might be issued to a single functional unit in a single cycle. It has an optional label, an opcode, a set of source operands, and a set of target operands. The sources are separated from the targets by the symbol , pronounced “into.”

335

336

 

 

APPENDIX A. ILOC

operation

opcode operand-list operand-list

operand-list

operand

 

|

operand , operand-list

operand

register

 

|

number

 

 

|

label

Operands come in three types: register, number, and label. The type of each operand is determined by the opcode and the position of the operand in the operation. In the examples, we make this textually obvious by beginning all register operands with the letter r and all labels with a letter other than r (typically, with an l).

We assume that source operands are read at the beginning of the cycle when the operation issues and that target operands are defined at the end of the cycle in which the operation completes.

Most operations have a single target operand; some of the store operations have multiple target operations. For example, the storeAI operation has a single source operand and two target operands. The source must be a register, and the targets must be a register and an immediate constant. Thus, the iloc operation

storeAI ri rj , 4

computes an address by adding 4 to the contents of rj and stores the value found in ri into the memory location specified by the address. In other words,

Memory(rj +4) Contents(ri)

The non-terminal opcode can be any of the iloc operation codes. Unfortunately, as in a real assembly language, the relationship between an opcode and the form of its arguments is less than systematic. The easiest way to specify the form of each opcode is in a tabular form. Figure A.2 at the end of this appendix shows the number of operands and their types for each iloc opcode used in the book.

As a lexical matter, iloc comments begin with the string // and continue until the end of a line. We assume that these are stripped out by the scanner; thus they can occur anywhere in an instruction and are not mentioned in the grammar.

To make this discussion more concrete, let’s work through the example used in Chapter 1. It is shown in Figure A.1. To start, notice the comments on the right edge of most lines. In our iloc-based systems, comments are automatically generated by the compiler’s front end, to make the iloc code more readable by humans. Since the examples in this book are intended primarily for humans, we continue this tradition of annotating the iloc

This example assumes that register rsp holds the address of a region in memory where the variables w, x, y, and z are stored at o sets 0, 8, 16, and 24, respectively.

337

loadAI

rsp, 0

rw

// w is at offset

0 from rsp

loadI

2

r2

// constant 2 into r2

loadAI

rsp, 8

rx

// x is at offset

8

loadAI

rsp, 12

ry

// y is at offset

12

loadAI

rsp, 16

rz

// z is at offset

16

mult

rw , r2

rw

// rw w×2

 

mult

rw , rx

rw

// rw (w×2) × x

mult

rw , ry

rw

// rw (w×2×x) × y

mult

rw , rz

rw

// rw (w×2×x×y) × z

storeAI

rw

rsp, 0

// write rw back to ’w’

Figure A.1: Introductory example, revisited

The first instruction is a loadAI operation, or a load address-immediate. From the opcode table, we can see that it combines the contents of rsp with the immediate constant 0 and retrieves the value found in memory at that address. We know, from above, that this value is w. It stores the retrieved value in rw . The next instruction is a loadi operation, or a load immediate. It moves the value 2 directly into r2. (E ectively, it reads a constant out of the instruction stream and into a register.) Instructions three through five load the values of x into rx, y into ry , and z into rz .

The sixth instruction multiplies the contents of rw and r2, storing the result back into rw. Instruction seven multiplies this quantity by rx. Instruction eight multiplies in ry , and instruction nine picks up rz . In each instruction from six through nine, the value is accumulated into rw .

Finally, instruction ten saves the value to memory. It uses a storeAI, or store address-immediate, to write the contents of rw into the memory location at o set 0 from rsp. As pointed out in Chapter 1, this sequence evaluates the expression

w w × 2 × x × y × z

The opcode table, at the end of this appendix, lists all of the opcodes used in iloc examples in the book.

Comparison and Conditional Branch In general, the iloc comparison operators take two values and return a boolean value. The operations cmp LT, cmp LE, cmp EQ, cmp NE, cmp GE, and cmp GT work this way. The corresponding conditional branch, cbr, takes a boolean as its argument and transfers control to one of two target labels. The first label is selected if the boolean is true; the second label is selected if the label is false.

Using two labels on the conditional branch has two advantages. First, the code is somewhat more concise. In several situations, a conditional branch might be followed by an absolute branch. The two-label branch lets us record that combination in a single operation. Second, the code is easier to manipulate. A single-label conditional branch implies some positional relationship with the next instruction; there is an implicit connection between the branch and its

338

APPENDIX A. ILOC

“fall-through” path. The compiler must take care, particularly when reading and writing linear code, to preserve these relationships. The two-label conditional branch makes this implicit connection explicit, and removes any possible positional dependence.

Because the two branch targets are not “defined” by the instruction, we change the syntax slightly. Rather than use the arrow, we write branches with the smaller arrow.

In a few places, we want to discuss what happens when the comparison returns a complex value written into a designated area for a “condition code.” The condition code is a multi-bit value that can only be interpreted with a more complex conditional branch instruction. To talk about this mechanism, we use an alternate set of comparison and conditional branch operators. The comparison operator, comp takes two values and sets the condition code appropriately. We always designate the target of comp as a condition code register by writing it cci. The corresponding conditional branch has six variants, one for each comparison result. Figure A.3 shows these instructions and their meaning.

Note: In a real compiler that used iloc, we would need to introduce some representation for distinct data types. The research compiler that we built using iloc had several distinct data types—integer, single-precision floatingpoint, double-precision floating-point, complex, and pointer.

A.0.1 Naming Conventions

1.Memory o sets for variables are represented symbolically by prefixing the variable name with the @ character.

2.The user can assume an unlimited supply of registers. These are named with simple integers, as in r1776.

3.The register r0 is reserved as a pointer the current activation record. Thus, the operation loadAI r0,@x r1 implicitly exposes the fact that the variable x is stored in the activation record of the procedure containing the operation.

A.0.2 Other Important Points

An iloc operation reads its operands at the start of the cycle in which it issues. It writes its target at the end of the cycle in which it finishes. Thus, two operations in the same instruction can both refer to a given register. Any uses receive the value defined at the start of the cycle. Any definitions occur at the end of the cycle. This is particularly important in Figure 11.9.

339

Opcode

Sources

Targets

Meaning

 

 

 

 

 

 

 

add

r1, r2

r3

r1 + r2 r3

sub

r1, r2

r3

r1 r2 r3

mult

r1, r2

r3

r1 × r2 r3

div

r1, r2

r3

r1 ÷ r2 r3

addI

r1, c1

r2

r1 + c1 r2

subI

r1, c1

r2

r1 c1 r2

multI

r1, c1

r2

r1 × c1 r2

divI

r1, c1

r2

r1 ÷ c1 r2

lshift

r1, r2

r3

r1 r2 r3

lshiftI

r1, c2

r3

r1 c2 r3

rshift

r1, r2

r3

r1 r2 r3

rshiftI

r1, c2

r3

r1 c2 r3

loadI

c1

r2

c1 r2

 

load

r1

r2

Memory(r1) r2

loadAI

r1, c1

r2

Memory(r1+c1) r2

loadAO

r1, r2

r3

Memory(r1+r2) r3

cload

r1

r2

character

load

cloadAI

r1, r2

r3

character

loadAI

cloadAO

r1, r2

r3

character

loadAO

store

r1

r2

r1 Memory(r2)

storeAI

r1

r2, c1

r1 Memory(r2+c1)

storeAO

r1

r2, r3

r1 Memory(r2+r3)

cstore

r1

r2

character

store

cstoreAI

r1

r2, r3

character

storeAI

cstoreAO

r1

r2, r3

character

storeAO

br

 

l1

l1

pc

 

cbr

r1

l1, l2

r1

= true l1 pc

 

 

 

 

 

r1 = false l2 pc

cmp

 

LT

r1, r2

r3

r1 < r2 true r3

 

 

 

 

 

 

(otherwise, false r3)

cmp

 

LE

r1, r2

r3

r1 r2 true r3

 

cmp

 

EQ

r1, r2

r3

r1 = r2 true r3

 

cmp

 

NE

r1, r2

r3

r1 = r2 true r3

 

cmp

 

GE

r1, r2

r3

r1 r2 true r3

 

cmp

 

GT

r1, r2

r3

r1 > r2 true r3

 

i2i

r1

r2

r1 r2

 

c2c

r1

r2

r1 r2

 

c2i

r1

r2

converts character to integer

i2c

r1

r2

converts integer to character

Figure A.2: Iloc opcode table

340

APPENDIX A. ILOC

Opcode

Sources

Targets

Meaning

 

comp

r1, r2

cc1

defines cc1

 

cbr

 

LT

cc1

l1, l2

cc1 = LT l1 pc

 

 

 

 

 

 

(otherwise l2 pc)

cbr

 

LE

cc1

l1, l2

cc1 = LE l1 pc

 

cbr

 

EQ

cc1

l1, l2

cc1

= EQ l1 pc

 

cbr

 

GE

cc1

l1, l2

cc1

= GE l1 pc

 

cbr

 

GT

cc1

l1, l2

cc1

= GT l1

pc

 

cbr

 

NE

cc1

l1, l2

cc1

= NE l1

pc

 

Figure A.3: Alternate Compare/Branch Syntax

Соседние файлы в предмете Электротехника