Добавил:

Andrey Опубликованный материал нарушает ваши авторские права? Сообщите нам.

Вуз:

Санкт-Петербургский государственный электротехнический университет "ЛЭТИ"

Предмет:

Электротехника

Файл:

Cooper K.Engineering a compiler

.pdf

Скачиваний:

Добавлен:

23.08.2013

Размер:

2.31 Mб

Скачать

☆

<<< < Предыдущая 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 3233 / 3633 34 35 36 > Следующая >>>

11.5. MORE AGGRESSIVE TECHNIQUES

313

entry

exit-

entry-

exit-

exit

- B

- B2

Original Code

After Cloning

Figure 11.11: Cloning a Tail-call

to avoid excessive code growth and to avoid unwinding loops. A typical implementation might clone blocks within an innermost loop, stopping when it reaches a loop-closing edge (sometimes called a back-edge in the control-ﬂow graph). This creates a situation where the only multiple-entry block in the loop is the ﬁrst block in the loop. All other paths have only single-entry blocks.

A second example that merits consideration arises in tail-recursive programs. Recall, from Section 8.8.2, that a program is tail recursive if its last action is a recursive self-invocation. When the compiler detects a tail-call, it can convert the call into a branch back to the procedure’s entry. This implements the recursion with an ad hoc looping construct rather than a full-weight procedure call. From the scheduler’s point of view, however, cloning may improve the situation.

The left side of Figure 11.11 shows an abstracted control-ﬂow graph for a tail-recursive routine, after the tail-call has been converted to iteration. Block B1 is entered along two paths, the path from the procedure entry and the path from B2. This forces the scheduler to use worst-case assumptions about what precedes B1. By cloning B1, as shown on the right, the compiler can create the situation where control enters B1 along only one edge. This may improve the results of regional scheduling, with an ebb scheduler or a loop scheduler. To further simplify the situation, the compiler might coalesce B1 onto the end of B2, creating a single block loop body. The resulting loop can be scheduled with either a local scheduler or a loop scheduler, as appropriate.

11.5.2Global Scheduling

Of course, the compiler could attempt to take a global approach to scheduling, just as compilers take a global approach to register allocation. Global scheduling schemes are a cross between code motion, performed quite late, and regional scheduling to handle the details of instruction ordering. They require a global graph to represent precedence; it must account for the ﬂow of data along all possible control-ﬂow paths in the program.

These algorithms typically determine, for each operation, the earliest position in the control-ﬂow graph that is consistent with the constraints represented in the precedence graph. Given that location, they may move the operation later in the control-ﬂow graph to the deepest location that is controlled by the same

314	CHAPTER 11. INSTRUCTION SCHEDULING

Digression: Measuring Run-time Performance

The primary goal of instruction scheduling is to improve the running time of the generated code. Discussions of performance use many di erent metrics; the two most common are:

Operations per second The metric commonly used to advertise computers and to compare system performance is the number of operations executed in a second. This can be measured as instructions issued per second or instructions retired per second.

Time to complete a ﬁxed task This metric uses one or more programs whose behavior is known, and compares the time required to complete these ﬁxed tasks. This approach, called benchmarking, provides information about overall system performance, both hardware and software, on a particular workload.

No single metric contains enough information to allow evaluation of the quality of code generated by the compiler’s back end. For example, if the measure is operations per second, does the compiler get extra credit for leaving extraneous (but independent) operations in code? The simple timing metric provides no information about what is achievable for the program. Thus, it allows one compiler to do better than another, but fails to show the distance between the generated code and what is optimal for that code on the target machine.

Numbers that the compiler writer might want to measure include the percentage of executed instructions whose output is actually used, and the percentage of cycles spent in stalls and interlocks. The former gives insight into some aspects of predicated execution, while the latter directly measures some aspects of schedule quality.

set of conditions. The former heuristic is intended to move operations out of deeply nested loops and into less frequently executed positions. Moving the operation earlier in the control-ﬂow graph often increases, at least temporarily, the size of its live range. That leads to the latter heuristic, which tries to move the operation as close to its subsequent use without moving it any deeper in the nesting of the control-ﬂow graph.

The compiler writer may be able to achieve similar results in a simpler way—using a specialized algorithm to perform code motion (such as Lazy Code Motion, described in Chapter 14) followed by a strong local or regional scheduler. The arguments for global optimization and for global allocation may not carry over to scheduling; the emphasis on scheduling is to avoid stalls, interlocks, and nops. These latter issues tend to be localized in their impact.

11.6. SUMMARY AND PERSPECTIVE

315

11.6Summary and Perspective

Algorithms that guarantee optimal schedules exist for simpliﬁed situations. For example, on a machine with one functional unit and uniform operation latencies, the Sethi-Ullman labelling algorithm creates an optimal schedule for an expression tree [47]. It can be adapted to produce good code for expression dags. Fischer and Proebsting built on the labelling algorithm to derive an algorithm that produces optimal or near optimal results for small memory latencies [43]. Unfortunately, its has trouble when either the number of functional units or their latencies rise.

In practice, modern computers have become complex enough that none of the simpliﬁed models adequately reﬂect their behavior. Thus, most compilers use some form of list scheduling. The algorithm is easily adapted and parameterized; it can be run for forward scheduling and backward scheduling. The technology of list scheduling is the base from which more complex schedulers, like software pipeliners, are built.

Techniques that operate over larger regions have grown up in response to real problems. Trace scheduling was developed for vliw architectures, where the compiler needed to keep many functional units busy. Techniques that schedule extended basic blocks and loops are, in essence, responses to the increase in both the number of pipelines that the compiler must consider and their individual latencies. As machines have become more complex, schedulers have needed a larger scheduling context to discover enough instruction-level parallelism to keep the machines busy.

The example for backward versus forward scheduling in Figure 11.4 was brought to our attention by Philip Schielke [46]. It is from the Spec benchmark program go. It captures, concisely, an e ect that has caused many compiler writers to include both forward and backward schedulers in their back ends.

334	CHAPTER 11. INSTRUCTION SCHEDULING

Appendix A

ILOC

Introduction

Iloc is the linear assembly code for a simple abstract machine. The iloc abstract machine is a risc-like architecture. We have assumed, for simplicity, that all operations work on the same type of data—sixty-four bit integer data.

It has an unlimited number of registers. It has a couple of simple addressing modes, load and store operations, and three address register-to-register operators.

An iloc program consists of a sequential list of instructions. An instruction may have a label; a label is followed immediately by a colon. If more than one label is needed, we represent it in writing by adding the special instruction nop that performs no action. Formally:

iloc-program → instruction-list

instruction-list → instruction

|label : instruction

|instruction instruction-list

Each instruction contains one or more operations. A single-operation instruction is written on a line of its own, while a multi-operation instruction can span several lines. To group operations into a single instruction, we enclose them in square brackets and separate them with semi-colons. More formally:

instruction → operation

|[ operation-list ]

operation-list → operation

|operation ; operation-list

An iloc operation corresponds to an instruction that might be issued to a single functional unit in a single cycle. It has an optional label, an opcode, a set of source operands, and a set of target operands. The sources are separated from the targets by the symbol , pronounced “into.”

335

336			APPENDIX A. ILOC
operation	→	opcode operand-list operand-list
operand-list	→	operand
	\|	operand , operand-list
operand	→	register
	\|	number
	\|	label

Operands come in three types: register, number, and label. The type of each operand is determined by the opcode and the position of the operand in the operation. In the examples, we make this textually obvious by beginning all register operands with the letter r and all labels with a letter other than r (typically, with an l).

We assume that source operands are read at the beginning of the cycle when the operation issues and that target operands are deﬁned at the end of the cycle in which the operation completes.

Most operations have a single target operand; some of the store operations have multiple target operations. For example, the storeAI operation has a single source operand and two target operands. The source must be a register, and the targets must be a register and an immediate constant. Thus, the iloc operation

storeAI ri rj , 4

computes an address by adding 4 to the contents of rj and stores the value found in ri into the memory location speciﬁed by the address. In other words,

Memory(rj +4) ← Contents(ri)

The non-terminal opcode can be any of the iloc operation codes. Unfortunately, as in a real assembly language, the relationship between an opcode and the form of its arguments is less than systematic. The easiest way to specify the form of each opcode is in a tabular form. Figure A.2 at the end of this appendix shows the number of operands and their types for each iloc opcode used in the book.

As a lexical matter, iloc comments begin with the string // and continue until the end of a line. We assume that these are stripped out by the scanner; thus they can occur anywhere in an instruction and are not mentioned in the grammar.

To make this discussion more concrete, let’s work through the example used in Chapter 1. It is shown in Figure A.1. To start, notice the comments on the right edge of most lines. In our iloc-based systems, comments are automatically generated by the compiler’s front end, to make the iloc code more readable by humans. Since the examples in this book are intended primarily for humans, we continue this tradition of annotating the iloc

This example assumes that register rsp holds the address of a region in memory where the variables w, x, y, and z are stored at o sets 0, 8, 16, and 24, respectively.

337

loadAI	rsp, 0	rw	// w is at offset	0 from rsp
loadI	2	r2	// constant 2 into r2
loadAI	rsp, 8	rx	// x is at offset	8
loadAI	rsp, 12	ry	// y is at offset	12
loadAI	rsp, 16	rz	// z is at offset	16
mult	rw , r2	rw	// rw ← w×2
mult	rw , rx	rw	// rw ← (w×2) × x
mult	rw , ry	rw	// rw ← (w×2×x) × y
mult	rw , rz	rw	// rw ← (w×2×x×y) × z
storeAI	rw	rsp, 0	// write rw back to ’w’

Figure A.1: Introductory example, revisited

The ﬁrst instruction is a loadAI operation, or a load address-immediate. From the opcode table, we can see that it combines the contents of rsp with the immediate constant 0 and retrieves the value found in memory at that address. We know, from above, that this value is w. It stores the retrieved value in rw . The next instruction is a loadi operation, or a load immediate. It moves the value 2 directly into r2. (E ectively, it reads a constant out of the instruction stream and into a register.) Instructions three through ﬁve load the values of x into rx, y into ry , and z into rz .

The sixth instruction multiplies the contents of rw and r2, storing the result back into rw. Instruction seven multiplies this quantity by rx. Instruction eight multiplies in ry , and instruction nine picks up rz . In each instruction from six through nine, the value is accumulated into rw .

Finally, instruction ten saves the value to memory. It uses a storeAI, or store address-immediate, to write the contents of rw into the memory location at o set 0 from rsp. As pointed out in Chapter 1, this sequence evaluates the expression

w ← w × 2 × x × y × z

The opcode table, at the end of this appendix, lists all of the opcodes used in iloc examples in the book.

Comparison and Conditional Branch In general, the iloc comparison operators take two values and return a boolean value. The operations cmp LT, cmp LE, cmp EQ, cmp NE, cmp GE, and cmp GT work this way. The corresponding conditional branch, cbr, takes a boolean as its argument and transfers control to one of two target labels. The ﬁrst label is selected if the boolean is true; the second label is selected if the label is false.

Using two labels on the conditional branch has two advantages. First, the code is somewhat more concise. In several situations, a conditional branch might be followed by an absolute branch. The two-label branch lets us record that combination in a single operation. Second, the code is easier to manipulate. A single-label conditional branch implies some positional relationship with the next instruction; there is an implicit connection between the branch and its

338	APPENDIX A. ILOC

“fall-through” path. The compiler must take care, particularly when reading and writing linear code, to preserve these relationships. The two-label conditional branch makes this implicit connection explicit, and removes any possible positional dependence.

Because the two branch targets are not “deﬁned” by the instruction, we change the syntax slightly. Rather than use the arrow, we write branches with the smaller → arrow.

In a few places, we want to discuss what happens when the comparison returns a complex value written into a designated area for a “condition code.” The condition code is a multi-bit value that can only be interpreted with a more complex conditional branch instruction. To talk about this mechanism, we use an alternate set of comparison and conditional branch operators. The comparison operator, comp takes two values and sets the condition code appropriately. We always designate the target of comp as a condition code register by writing it cci. The corresponding conditional branch has six variants, one for each comparison result. Figure A.3 shows these instructions and their meaning.

Note: In a real compiler that used iloc, we would need to introduce some representation for distinct data types. The research compiler that we built using iloc had several distinct data types—integer, single-precision ﬂoatingpoint, double-precision ﬂoating-point, complex, and pointer.

A.0.1 Naming Conventions

1.Memory o sets for variables are represented symbolically by preﬁxing the variable name with the @ character.

2.The user can assume an unlimited supply of registers. These are named with simple integers, as in r1776.

3.The register r0 is reserved as a pointer the current activation record. Thus, the operation loadAI r0,@x r1 implicitly exposes the fact that the variable x is stored in the activation record of the procedure containing the operation.

A.0.2 Other Important Points

An iloc operation reads its operands at the start of the cycle in which it issues. It writes its target at the end of the cycle in which it ﬁnishes. Thus, two operations in the same instruction can both refer to a given register. Any uses receive the value deﬁned at the start of the cycle. Any deﬁnitions occur at the end of the cycle. This is particularly important in Figure 11.9.

339

Opcode		Sources	Targets	Meaning

add		r1, r2	r3	r1 + r2 → r3
sub		r1, r2	r3	r1 − r2 → r3
mult		r1, r2	r3	r1 × r2 → r3
div		r1, r2	r3	r1 ÷ r2 → r3
addI		r1, c1	r2	r1 + c1 → r2
subI		r1, c1	r2	r1 − c1 → r2
multI		r1, c1	r2	r1 × c1 → r2
divI		r1, c1	r2	r1 ÷ c1 → r2
lshift		r1, r2	r3	r1 r2 → r3
lshiftI		r1, c2	r3	r1 c2 → r3
rshift		r1, r2	r3	r1 r2 → r3
rshiftI		r1, c2	r3	r1 c2 → r3
loadI		c1	r2	c1 → r2
load		r1	r2	Memory(r1) → r2
loadAI		r1, c1	r2	Memory(r1+c1) → r2
loadAO		r1, r2	r3	Memory(r1+r2) → r3
cload		r1	r2	character		load
cloadAI		r1, r2	r3	character		loadAI
cloadAO		r1, r2	r3	character		loadAO
store		r1	r2	r1 → Memory(r2)
storeAI		r1	r2, c1	r1 → Memory(r2+c1)
storeAO		r1	r2, r3	r1 → Memory(r2+r3)
cstore		r1	r2	character		store
cstoreAI		r1	r2, r3	character		storeAI
cstoreAO		r1	r2, r3	character		storeAO
br			l1	l1	→ pc
cbr		r1	l1, l2	r1	= true l1 → pc
				r1 = false l2 → pc
cmp	LT	r1, r2	r3	r1 < r2 true → r3

				(otherwise, false → r3)
cmp	LE	r1, r2	r3	r1 ≤ r2 true → r3

cmp	EQ	r1, r2	r3	r1 = r2 true → r3

cmp	NE	r1, r2	r3	r1 = r2 true → r3

cmp	GE	r1, r2	r3	r1 ≥ r2 true → r3

cmp	GT	r1, r2	r3	r1 > r2 true → r3

i2i		r1	r2	r1 → r2
c2c		r1	r2	r1 → r2
c2i		r1	r2	converts character to integer
i2c		r1	r2	converts integer to character

Figure A.2: Iloc opcode table

340	APPENDIX A. ILOC

Opcode		Sources	Targets	Meaning
comp		r1, r2	cc1	deﬁnes cc1
cbr	LT	cc1	l1, l2	cc1 = LT l1 → pc

				(otherwise l2 → pc)
cbr	LE	cc1	l1, l2	cc1 = LE l1 → pc

cbr	EQ	cc1	l1, l2	cc1	= EQ l1 → pc

cbr	GE	cc1	l1, l2	cc1	= GE l1 → pc

cbr	GT	cc1	l1, l2	cc1	= GT l1	→ pc

cbr	NE	cc1	l1, l2	cc1	= NE l1	→ pc

Figure A.3: Alternate Compare/Branch Syntax

<<< < Предыдущая 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 3233 / 3633 34 35 36 > Следующая >>>

Соседние файлы в предмете Электротехника

#
23.08.2013168.53 Кб9CompactPCI specification short form.Rev 2.1.1997.pdf
#
23.08.20132.86 Mб17Computer arithmetic lectures.pdf
#
23.08.20132.04 Mб326Conklin E.K.Forth programmer's handbook.2000.pdf
#
23.08.201384.88 Кб13Constantinides C.AOP considered harmful.pdf
#
23.08.201357.95 Кб140Construction and use of broadband transformers.pdf
#
23.08.20132.31 Mб53Cooper K.Engineering a compiler.pdf
#
23.08.20134.82 Mб24Cooper M.Advanced Bash scripting guide Rev1.4.2002.pdf
#
23.08.2013916.67 Кб16Cooper M.Advanced bash-scripting guide.2002.pdf
#
23.08.201356.71 Кб16Cortes L.Designing a graphical user interface.1997.pdf
#
23.08.2013642.66 Кб24Coulson D.Mastering IPTables.pdf
#
23.08.201329.44 Кб5cover.pdf