Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

MIPS_primery_zadach / dandamudi05gtr guide risc processors programmers engineers

.pdf
Скачиваний:
65
Добавлен:
11.05.2015
Размер:
1.39 Mб
Скачать

Chapter 7 Itanium Architecture

 

 

 

 

 

103

 

 

 

 

 

 

41 bits

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

4

 

 

10

7

 

 

7

 

7

6

 

 

Major

 

Opcode extension

 

r3

 

 

r2

 

r1

qp

 

opcode

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

(a) Register format

 

 

 

 

 

4

1

 

9

7

 

 

7

 

7

6

 

 

Major

S

Opcode extension

 

r3

 

 

imm7

 

r1

qp

 

opcode

 

 

 

 

 

 

 

 

 

8−bit immediate format

 

 

 

 

 

4

1

3

6

7

 

 

7

 

7

6

 

 

Major

S

Opcode

imm6

 

r3

 

 

imm7

 

r1

qp

 

opcode

extension

 

 

 

 

 

 

 

 

 

14−bit immediate format

 

 

 

 

4

1

 

9

5

2

7

 

7

6

 

 

Major

S

 

imm9

imm5

r3

imm7

 

r1

qp

 

opcode

 

 

 

 

 

 

 

22−bit immediate format

 

 

 

 

 

 

 

 

 

(b) Immediate formats

 

 

 

 

 

4

 

4

6

7

 

 

7

1

6

6

 

 

Major

 

Opcode

p2

 

r3

 

 

r2

c

p1

qp

 

opcode

 

extension

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Basic compare format

 

 

 

 

 

4

1

3

6

7

 

 

7

1

6

6

 

 

Major

S

Opcode

p2

 

r3

 

 

imm7

c

p1

qp

 

opcode

extension

 

 

 

Compare immediate format

(c) Compare formats

Figure 7.2 Sample Itanium instruction formats (continued on the next page).

The Itanium supports three types of immediate formats: 8-, 14-, and 22-bit immediate values can be specified. In the immediate format, the sign of the immediate value is always placed in the fifth leftmost bit (the S bit in Figure 7.2b). When a 22-bit immediate value is specified, the destination register must be one of the first four registers (r0 through r3), as we have only two bits to specify this register.

104

 

 

 

 

 

 

Guide to RISC Processors

 

 

 

 

 

 

41 bits

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

4

1

1

2

20

 

1

3

3

6

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Major

S

d

wh

imm20

 

p

 

btype

qp

 

 

opcode

 

 

 

 

 

 

 

 

IP−relative branch format

 

 

 

 

 

 

4

1

1

2

20

 

1

3

3

6

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Major

S

d

wh

imm20

 

p

 

b1

qp

 

 

opcode

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

IP−relative call format

(d) Branch and call formats

4

 

 

7

2

1

7

 

7

Major

 

 

 

Opcode extension

 

hint

 

x

 

r3

 

 

opcode

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Register indirect format

 

4

 

 

7

2

1

7

 

7

Major

 

 

 

Opcode extension

 

hint

 

x

 

r3

 

r2

 

 

 

 

 

 

 

opcode

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Register indirect with index format

4

1

6

2

1

7

 

7

Major

 

S

 

Opcode extension

 

hint

 

imm1

 

r3

 

imm7

 

 

 

 

 

 

 

 

 

 

 

 

 

opcode

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Register indirect with immediate format

(e) Integer load formats

Figure 7.2 Continued.

66

r1 qp

66

r1 qp

66

r1 qp

Two sample compare instruction formats are presented in Figure 7.2c. As we show later, compare instructions can specify two predicate registers p1 and p2. In a typical compare operation, the two source operands are compared, and the result is placed in p1 and its complement in p2. The c bit extends the opcode to specify the complement compare relation. For example, if the encoding with c = 0 represents a “less than” comparison, c = 1 converts this to a “greater than or equal to” type of comparison.

Figure 7.2d shows a sample format for branch and call instructions. Each instruction takes a 21-bit signed IP-relative displacement to the target. Note that the Itanium also supports indirect branch and call instructions. For both instructions, d and wh fields are opcode extensions to specify hints. The d bit specifies the branch cache deallocation hint, and the wh field gives the branch whether hint. These two hints are discussed later.

Chapter 7 Itanium Architecture

105

In the branch instruction, the btype field indicates the type of branch instruction (e.g., branch on equality, less than, etc.). In the call instruction, the b1 field specifies the register that should receive the return address.

The integer load instruction formats are shown in Figure 7.2e. The x bit is used for opcode extension. In the indexed addressing mode, the 9-bit immediate value is split into three pieces: S, imm1, and imm7. The hint field, which gives the memory reference hint, is discussed later in this chapter.

Instruction-Level Parallelism

The Itanium enables instruction-level parallelism by letting the compiler/assembler explicitly indicate parallelism by providing run-time support to execute instructions in parallel, and by providing a large number of registers to avoid register contention. First we discuss the instruction groups, and then see how the hardware facilitates parallel execution of instructions by bundling nonconflicting instructions together.

Itanium instructions are bound into instruction groups. An instruction group is a set of instructions that do not have conflicting dependencies among them (read-after-write or write-after-write dependencies, as discussed later on page 115), and may execute in parallel. The compiler or assembler can indicate instruction groups by using the ;; notation. Let us look at a simple example to get an idea. Consider evaluating a logical expression consisting of four terms. For simplicity, assume that the results of these four logical terms are in registers r10, r11, r12, and r13. Then the logical expression in

if (r10 || r11 || r12 || r13) {

/* if-block code */

}

can be evaluated using or-tree reduction as

or

r1

= r10,r11

/* Group 1 */

or

r2

= r12,r13;;

 

or

r3

= r1,r2;;

/* Group 2 */

other

instructions

/* Group 3 */

The first group performs two parallel or operations. Once these results are available, we can compute the final value of the logical expression. This final value in r3 can be used by other instructions to test the condition. Inasmuch as we have not discussed Itanium instructions, it does not make sense to explain these instructions at this point. We have some examples in a later section.

In any given clock cycle, the processor executes as many instructions from one instruction group as it can, according to its resources. An instruction group must contain at least one instruction; the number of instructions in an instruction group is not limited. Instruction groups are indicated in the code by cycle breaks (;;). An instruction group may also end dynamically during run-time by a taken branch.

106

 

 

 

 

Guide to RISC Processors

127

87

86

46

45

5 4

0

 

 

 

 

 

 

 

 

 

Instruction slot 2

Instruction slot 1

Instruction slot 0

 

Template

 

 

 

 

 

 

 

 

 

 

 

 

 

41 bits

 

41 bits

41 bits

 

 

5 bits

Figure 7.3 Itanium instruction bundle format.

An advantage of instruction groups is that they reduce the need to optimize the code for each new microarchitecture. Processors with additional resources can take advantage of the existing ILP in the instruction group.

By means of instruction groups, compilers package instructions that can be executed in parallel. It is the compiler’s responsibility to make sure that instructions in a group do not have conflicting dependencies. Armed with this information, instructions in a group are bundled together as shown in Figure 7.3. Three instructions are collected into 128-bit, aligned containers called bundles. Each bundle contains three 41-bit instruction slots and a 5-bit template field.

The main purpose of the template field is to specify mapping of instruction slots to execution instruction types. Instructions are categorized into six instruction types: integer ALU, non-ALU integer, memory, floating-point, branch, and extended. A specific execution unit may execute each type of instruction. For example, floating-point instructions are executed by the F-unit, branch instructions by the B-unit, and memory instructions such as load and store by the M-unit. The remaining three types of instructions are executed by the I-unit. All instructions, except extended instructions, occupy one instruction slot. Extended instructions, which use long immediate integers, occupy two instruction slots.

Instruction Set

As in the other chapters, we discuss several sample groups of instructions from the Itanium instructions set.

Data Transfer Instructions

The Itanium’s load and store instructions are more complex than those in a typical RISC processor. The Itanium supports speculative loads to mask high latency associated with reading data from memory.

The basic load instruction takes one of the three forms shown below depending on the addressing mode used:

(qp) ldSZ.ldtype.ldhint

r1

= [r3]

/* No update form */

(qp) ldSZ.ldtype.ldhint

r1

=

[r3],r2

/* Update form 1 */

(qp) ldSZ.ldtype.ldhint

r1

=

[r3],imm9

/* Update form 2 */

The load instruction loads SZ bytes from memory, starting at the effective address. The SZ completer can be 1, 2, 4, or 8 to load 1, 2, 4, or 8 bytes. In the first load instruction,

Chapter 7 Itanium Architecture

107

register r3 provides the address. In the second instruction, contents of r3 and r2 are added to get the effective address. The third form uses a 9-bit signed immediate value, instead of register r2. In the last two forms, as explained earlier, the computed effective address is stored in r3.

The ldtype completer can be used to specify special load operations. For normal loads, the completer is not specified. For example, the instruction

ld8

r5 = [r6]

loads eight bytes from the memory starting from the effective address in r6. As mentioned before, the Itanium supports speculative loads. Two example instructions are shown below:

ld8.a

r5

=

[r6]

/* advanced load */

ld8.s

r5

=

[r6]

/* speculative load */

We defer a discussion of these load instruction types to a later section that discusses the speculative execution model of Itanium.

The ldhint completer specifies the locality of the memory access. It can take one of the following three values.

ldhint

Interpretation

 

 

None

Temporal locality, level 1

nt1

No temporal locality, level 1

nta

No temporal locality, all levels

 

 

A prefetch hint is implied in the two “update” forms of load instructions. The address in r3 after the update acts as a hint to prefetch the indicated cache line. In the “no update” form of load, r3 is not updated and no prefetch hint is implied. Level 1 refers to the cache level. Because we don’t cover temporal locality and cache memory in this book, we refer the reader to [6] for details on cache memory. It is sufficient to view the ldhint completer as giving a hint to the processor as to whether a prefetch is beneficial.

The store instruction is simpler than the load instruction. There are two types of store instructions, corresponding to the two addressing modes, as shown below:

(qp)

stSZ.sttype.sthint

r1

=

[r3]

/* No update form */

(qp)

stSZ.sttype.sthint

r1

=

[r3],imm9

/* Update form */

The SZ completer can have four values as in the load instruction. The sttype can be none or rel. If the rel value is specified, an ordered store is performed. The sthint gives a prefetch hint as in the load instruction. However, it can be either none or nta. When no value is specified, temporal locality at level 1 is assumed. The nta has the same interpretation as in the load instruction.

108

Guide to RISC Processors

The Itanium also has several move instructions to copy data into registers. We describe three of these instructions:

(qp) mov r1 = r3 (qp) mov r1 = imm22 (qp) movl r1 = imm64

These instructions move the second operand into the r1 register. The first two mov instructions are actually pseudoinstructions. That is, these instructions are implemented using other processor instructions. The movl is the only instruction that requires two instruction slots within the same bundle.

Arithmetic Instructions

The Itanium provides only the basic integer arithmetic operations: addition, subtraction, and multiplication. There is no divide instruction, either for integers or floating-point numbers. Division is implemented in software. Let’s start our discussion with the add instructions.

Add Instructions The format of the add instructions is given below:

(qp) add r1

= r2,r3

/* register form */

(qp) add

r1

=

r2,r3,1

/* plus 1 form */

(qp) add

r1

=

imm,r3

/* immediate form */

In the plus 1 form, the constant 1 is added as well. In the immediate form, imm can be a 14or 22-bit signed value. If we use a 22-bit immediate value, r3 can be one of the first four general registers GR0 through GR3 (i.e., only 2 bits are used to specify the second operand register as shown in Figure 7.2).

The immediate form is a pseudoinstruction that selects one of the two processor immediate add instructions,

(qp) add r1 = imm14,r3 (qp) add r1 = imm22,r3

depending on the size of the immediate operand size and value of r3. The move instruction

(qp) mov r1 = r3

is implemented as

(qp) add r1 = 0,r3

The move instruction

(qp) mov r1 = imm22

is implemented as

(qp) add r1 = imm22,r0

Remember that r0 is hardwired to value zero.

Chapter 7 Itanium Architecture

109

Subtract Instructions The subtract instruction sub has the same format as the add instruction. The contents of register r3 are subtracted from the contents of r2. In the minus 1 form, the constant 1 is also subtracted. In the immediate form, imm is restricted to an 8-bit value.

The instruction shladd (shift left and add)

(qp) shladd r1 = r2,count,r3

is similar to the add instruction, except that the contents of r2 are left-shifted by count bit positions before adding. The count operand is a 2-bit value, which restricts the shift to 1-, 2-, 3-, or 4-bit positions.

Multiply Instructions Integer multiply is done using the xmpy and xma instructions. These instructions do not use the general registers; instead, they use the floating-point registers.

The xmpy instruction has the following formats.

(qp) xmpy.l

f1

= f3,f4

(qp) xmpy.lu

f1

= f3,f4

(qp) xmpy.h

f1

=

f3,f4

(qp) xmpy.hu

f1

=

f3,f4

The two source operands, floating-point registers f3 and f4, are treated either as signed or unsigned integers. The completer u in the second and fourth instructions specifies that the operands are unsigned integers. The other two instructions treat the two integers as signed. The l or h indicate whether the lower or higher 64 bits of the result should be stored in the f1 floating-point register.

The xmpy instruction multiplies the two integers in f3 and f4 and places the lower or upper 64-bit result in the f1 register. Note that we get a 128-bit result when we multiply two 64-bit integers.

The xma instruction has four formats as does the xmpy instruction, as shown below:

(qp) xma.l

f1

= f3,f4,f2

(qp) xma.lu

f1

= f3,f4,f2

(qp) xma.h

f1

=

f3,f4,f2

(qp) xma.hu

f1

=

f3,f4,f2

This instruction multiplies the two 64-bit integers in f3 and f4 and adds the zero-extended 64-bit value in f2 to the product.

Logical Instructions

Logical operations and, or, and xor are supported by three logical instructions. There is no not instruction. However, the Itanium has an and-complement (andcm) instruction that complements one of the operands before performing the bitwise-and operation.

All instructions have the same format. We illustrate the format of these instructions for the and instruction:

110

Guide to RISC Processors

(qp) and r1 = r2,r3 (qp) and r1 = imm8,r3

The other three operations use the mnemonics or, xor, and andcm. The and-comple- ment instruction complements the contents of r3 and ands it with the first operand (contents of r2 or immediate value imm8).

Shift Instructions

Both leftand right-shift instructions are available. The shift instructions

(qp) shl r1 = r2,r3 (qp) shl r1 = r2,count

left-shift the contents of r2 by the count value specified by the second operand. The count value can be specified in r3 or given as a 6-bit immediate value. If the count value in r3 is more than 63, the result is all zeros.

Right-shift instructions use a similar format. Because right-shift can be arithmetical or logical depending on whether the number is signed or unsigned, two versions are available. The register versions of the right-shift instructions are shown below:

(qp)

shr

r1

=

r2,r3

(signed right shift)

(qp)

shr.u

r1

=

r2,r3

(unsigned right shift)

In the second instruction, the completer u is used to indicate the unsigned shift operation. We can also use a 6-bit immediate value for shift count as in the shl instruction.

Comparison Instructions

The compare instruction uses two completers as shown below:

(qp) cmp.crel.ctype p1,p2=r2,r3 (qp) cmp.crel.ctype p1,p2=imm8,r3

The two source operands are compared and the result is written to the two specified destination predicate registers. The type of comparison is specified by crel. We can specify one of 10 relations for signed and unsigned numbers. The relations “equal” (eq) and “not equal” (neq) are valid for both signed and unsigned numbers. For signed numbers, there are 4 relations to test for “<” (lt), “” (le), “>” (gt), and “” (ge). The corresponding relations for testing unsigned numbers are ltu, leu, gtu, and geu. The relation is tested as “r2 rel r3”.

The ctype completer specifies how the two predicate registers are to be updated. The normal type (default) writes the comparison result in the p1 register and its complement in the p2 register. This would allow us to select one of the two branches (we show an example on page 113). The ctype completer allows specification of other types such as and and or. If or is specified, both p1 and p2 are set to 1 only if the comparison result is 1; otherwise, the two predicate registers are not altered. This is useful for implementing or-type simultaneous execution. Similarly, if and is specified, both registers are set to 0 if the comparison result is 0 (useful for and-type simultaneous execution).

Chapter 7 Itanium Architecture

111

Branch Instructions

As in the other architectures, the Itanium uses branch instruction for traditional jumps as well as procedure call and return. The generic branch is supplemented by a completer to specify the type of branch. The branch instruction supports both direct and indirect branching. All direct branches are IP relative (i.e., PC relative). Some sample branch instruction formats are shown below:

IP Relative Form:

(qp)

br.btype.bwh.ph.dh

target25

(Basic form)

(qp) br.btype.bwh.ph.dh

b1=target25

(Call form)

 

br.btype.bwh.ph.dh

target25

(Counted loop form)

Indirect Form:

(qp)

br.btype.bwh.ph.dh

b2

(Basic form)

(qp)

br.btype.bwh.ph.dh

b1=b2

(Call form)

As can be seen, branch uses up to four completers. The btype specifies the type of branch. The other three completers provide hints and are discussed later.

For the basic branch, btype can be either cond or none. In this case, the branch is taken if the qualifying predicate is 1; otherwise, the branch is not taken. The IP-relative target address is given as a label in the assembly language. The assembler translates this into a signed 21-bit value that gives the difference between the target bundle and the bundle containing the branch instruction. The target pointer is to a bundle of 128 bits, therefore the value (target25IP) is shifted right by 4 bit positions to get a 21-bit value. Note that the format shown in Figure 7.2d uses a 21-bit displacement value.

To invoke a procedure, we use the second form and specify call for btype. This turns the branch instruction into a condition call instruction. The procedure is invoked only if the qualifying predicate is true. As part of the call, it places the current frame marker and other relevant state information in the previous function state application register. The return link value is saved in the b1 branch register for use by the return instruction.

There is also an unconditional (no qualifying predicate) counted loop version. In this branch instruction (the third one), btype is set to cloop. If the Loop Count (LC) application register ar65 is not zero, it is decremented and the branch is taken.

We can use ret as the branch type to return from a procedure. It should use the indirect form and specify the branch register in which the call has placed the return pointer. In the indirect form, a branch register specifies the target address. The return restores the caller’s stack frame and privilege level.

The last instruction can be used for an indirect procedure call. In this branch instruction, the b2 branch register specifies the target address and the return address is placed in the b1 branch register.

Let us look at some examples of branch instructions. The instruction

112

 

Guide to RISC Processors

(p3) br skip

or

(p3) br.cond skip

transfers control to the instruction labeled skip, if the predicate register p3 is 1. The code sequence

mov lc = 100 loop_back:

. . .

br.cloop loop_back

executes the loop body 100 times. A procedure call may look like

(p0) br.call br2 = sum

whereas the return from procedure sum uses the indirect form

(p0) br.ret br2

Because we are using predicate register 0, which is hardwired to 1, both the call and return become unconditional.

The bwh (branch whether hint) completer can be used to convey whether the branch is taken (see page 119). The ph (prefetch hint) completer gives a hint about sequential prefetch. It can take either few or many. If the value is few or none, few lines are prefetched; many lines are prefetched when many is specified. The two levels—few and many—are system defined. The final completer dh (deallocation hint) specifies whether the branch cache should be cleared. The value clr indicates deallocation of branch information.

Handling Branches

Pipelining works best when we have a linear sequence of instructions. Branches cause pipeline stalls, leading to performance problems. How do we minimize the adverse effects of branches? There are three techniques to handle this problem.

Branch Elimination: The best solution is to avoid the problem in the first place. This argument may seem strange as programs contain lots of branch instructions. Although we cannot eliminate all branches, we can eliminate certain types of branches. This elimination cannot be done without support at the instruction-set level. We look at how the Itanium uses predication to eliminate some types of branches.

Branch Speedup: If we cannot eliminate a branch, at least we can reduce the amount of delay associated with it. This technique involves reordering instructions so that instructions that are not dependent on the branch/condition can be executed while the branch instruction is processed. Speculative execution can be used to reduce branch delays. We describe the Itanium’s speculative execution strategies later.

Соседние файлы в папке MIPS_primery_zadach