Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

MIPS_primery_zadach / dandamudi05gtr guide risc processors programmers engineers

.pdf
Скачиваний:
65
Добавлен:
11.05.2015
Размер:
1.39 Mб
Скачать

Chapter 7 Itanium Architecture

113

Branch Prediction: If we can predict whether the branch will be taken, we can load the pipeline with the right sequence of instructions. Even if we predict correctly all the time, it would only convert a conditional branch into an unconditional branch. We still have the problems associated with unconditional branches. We described three types of branch prediction strategies in Chapter 2 (see page 28).

Inasmuch as we covered branch prediction in Chapter 2, we discuss the first two techniques next.

Predication to Eliminate Branches In the Itanium, branch elimination is achieved by a technique known as predication. The trick is to make execution of each instruction conditional. Thus, unlike the instructions we have seen so far, an instruction is not automatically executed when the control is transferred to it. Instead, it will be executed only if a condition is true. This requires us to associate a predicate with each instruction. If the associated predicate is true, the instruction is executed; otherwise, it is treated as a nop instruction. The Itanium architecture supports full predication to minimize branches. Most of the Itanium’s instructions can be predicated.

To see how predication eliminates branches, let us look at the following example.

if (R1

== R2)

cmp

r1,r2

R3

= R3

+ R1;

je

then_part

else

 

 

sub

r3,r1

R3

= R3

- R1;

jmp

end_if

 

 

 

then_part:

 

 

 

 

add

r3,r1

 

 

 

end_if:

 

The code on the left-hand side, expressed in C, is a simple if-then-else statement. The IA32 assembly language equivalent is shown on the right. The cmp instruction compares the contents of the r1 and r2 registers and sets the condition code bits (in the IA-32, these are called the flag bits). If r1 = r2, the jump equal (je) instruction transfers control to the then_part. Otherwise, the sub instruction is executed (i.e., the else_part). Then the unconditional jump (jmp) instruction transfers control to end_if. As you can see from this code, it introduces two branches: unconditional (jmp) and conditional (je).

Using the Itanium’s predication, we can express the same as

cmp.eq p1,p2 = r1,r2 (p1) add r3 = r3,r1

(p2) sub r3 = r3,r1

The compare instruction sets two predicates after comparing the contents of r1 and r2 for equality. The result of this comparison is placed in p1 and its complement in p2. Thus, if the contents of r1 and r2 are equal, p1 is set to 1 (true) and p2 to 0 (false). Because the add instruction is predicated on p1, it is executed only if p1 is true. It should be

114

Guide to RISC Processors

clear that either the add or the sub instruction is executed, depending on the comparison result.

To illustrate the efficacy of predicated execution, we look at the following switch statement in C.

switch (r6)

{

case 1:

r2 = r3 + r4; break;

case 2:

r2 = r3 - r4; break;

case 3:

r2 = r3 + r5; break;

case 4:

r2 = r3 - r5; break;

}

For simplicity, we are using the register names in the switch statement. Translating this statement would normally involve a series of compare and branch instructions. Predication avoids this sequence as shown below:

cmp.eq p1,p0 = r6,1 cmp.eq p2,p0 = r6,2 cmp.eq p3,p0 = r6,3 cmp.eq p4,p0 = r6,4 ;;

(p1) add r2 = r3,r4

(p2) sub r2 = r3,r4

(p3) add r2 = r3,r5

(p4) sub r2 = r3,r5

In the first group of instructions, the four compare instructions set p1/p2/p3/p4 if the corresponding comparison succeeds. If the processor has resources, all four instructions can be executed concurrently. Because p0 is hardwired to 1, failure conditions are ignored in the above code. Depending on the compare instruction that succeeds, only one of the four arithmetic instructions in the second group is executed.

Speculative Execution

Speculative execution refers to the scenario where instructions are executed in the expectation that they will be needed in actual program execution. The main motivation, of course, is to improve performance. There are two main reasons to speculatively execute

Chapter 7 Itanium Architecture

115

instructions: to keep the pipeline full and to mask memory access latency. We discuss two types of speculative execution supported by the Itanium: one type handles data dependencies, and the other deals with control dependencies. Both techniques are compiler optimizations that allow the compiler to reorder instructions. For example, we can speculatively move high-latency load instructions earlier so that the data are available when they are actually needed.

Data Speculation

Data speculation allows the compiler to schedule instructions across some types of ambiguous data dependencies. When two instructions access common resources (either registers or memory locations) in a conflicting mode, data dependency exists. A conflicting access is one in which one or both instructions alter the data. Depending on the type of conflicting access, we can define the following dependencies.

Read-After-Write (RAW): This dependency exists between two instructions if one instruction writes into a register or a memory location that is later read by the other instruction.

Write-After-Read (WAR): This dependency exists between two instructions if one instruction reads from a register or a memory location that is later written by the other instruction.

Write-After-Write (WAW): This dependency exists between two instructions if one instruction writes into a register or a memory location that is later written by the other instruction.

Ambiguous: Ambiguous data dependency exists when pointers are used to access memory. Typically, in this case, dependencies between load and store instructions or store and store instructions cannot be resolved statically at compile time because we don’t know the pointer values. Handling this type of data dependency requires run-time support.

There is no conflict in allowing read-after-read (RAR) access. The first three dependencies are not ambiguous in the sense that the dependency type can be statically determined at compile/assembly time. The compiler or programmer should insert stops (;;) so that the dependencies are properly maintained. The example

sub r6=r7,r8 ;;

add r9=r10,r6

exhibits a RAW data dependency on r6. The stop after the sub instruction would allow the add instruction to read the value written by the sub instruction.

If there is no data dependency, the compiler can reorder instructions to optimize the code. Let us look at the following example:

sub

r6=r7,r8 ;;

// cycle 1

116

Guide to RISC Processors

sub

r9=r10,r6

//

cycle

2

ld8

r4=[r5] ;;

 

 

 

add

r11=r12,r4

//

cycle

4

Because there is a two-cycle latency to the first-level data cache, the add instruction is scheduled two cycles after scheduling the ld8 instruction. A straightforward optimization involves moving the ld8 instruction to cycle 1 as there are no data dependencies to prevent this reordering. By advancing the load instruction, we can schedule the add in cycle 3 as shown below:

ld8

r4=[r5]

// cycle 1

sub

r6=r7,r8 ;;

 

 

 

sub

r9=r10,r6 ;;

//

cycle

2

add

r11=r12,r4

//

cycle

3

However, when there is ambiguous data dependency, as in the following example, instruction reordering may not be possible:

sub

r6=r7,r8 ;;

// cycle 1

st8

[r9]=r6

// cycle 2

ld8

r4=[r5] ;;

 

 

 

add

r11=r12,r4 ;;

//

cycle

4

st8

[r10]=r11

//

cycle

5

In this code, ambiguous dependency exists between the first st8 and ld8 because r9 and r5 could be pointing to overlapped memory locations. Remember that each of these instructions accesses eight contiguous memory locations. This ambiguous dependency will not allow us to move the load instruction to cycle 1 as in the previous example.

The Itanium provides architectural support to move such load instructions. This is facilitated by advance load (ld.a) and check load (ld.c). The basic idea is that we initiate the load early and when it is time to actually execute the load, we will make a check to see if there is a dependency that invalidates our advance load data. If so, we reload; otherwise, we successfully advance the load instruction even when there is ambiguous dependency. The previous example with advance and check loads is shown below:

1:

ld8.a

r4=[r5]

//

cycle

0

or earlier

 

. . .

 

 

 

 

2:

sub

r6=r7,r8 ;;

//

cycle

1

 

Chapter 7 Itanium Architecture

117

3: st8

[r9]=r6

// cycle 2

4:ld8.c r4=[r5]

5:

add

r11=r12,r4 ;;

 

6:

st8

[r10]=r11

// cycle 3

We inserted an advance load (line 1) at cycle 0 or earlier so that r4 would have the value ready for the add instruction on line 5 in cycle 2. However, we have to check to see if we can use this value. This check is done by the check load instruction ld8.c on line 4. If there is no dependency between the store on line 3 and load on line 4, we can safely use the prefetched value. This is the case if the pointers in r9 and r5 are different. On the other hand, if the load instruction is reading the value written by the store instruction, we have to reload the value. The check load instruction on line 4 automatically reloads the value in the case of a conflict.

In the last example, we advanced just the load instruction. However, we can improve performance further if we can also advance all (or some of) the statements that depend on the value read by the load instruction. In our example, it would be nice if we could advance the add instruction on line 5. This causes a problem if there is a dependency between the store and load instructions (on lines 3 and 4). In this case, we not only have to reexecute the load but also the add instruction. The advance check (chk.a) instruction provides the necessary support for such reexecution as shown in the following example.

ld8.a

r4=[r5]

// cycle -1 or earlier

. . .

 

add

r11=r12,r4

// cycle 1

sub

r6=r7,r8 ;;

 

st8

[r9]=r6

// cycle 2

chk.a

r4,recover

 

back:

 

 

st8

[r10]=r11

 

recover:

 

 

ld8

r4=[r5]

// reload

add

r11=r12,r4

// reexecute add

br

back

// jump back

When the advanced load fails, the check instruction transfers control to recover to reload and reexecute all the instructions that used the value provided by the advanced load.

How does the Itanium maintain the dependency information for use by the check instructions? It keeps a hardware structure called the Advanced Load Address Table (ALAT), indexed by the physical register number. When an advanced load (ld.a) instruction is executed, it records the load address. When a check is executed (either ld.c

118

Guide to RISC Processors

or chk.a), it checks ALAT for the address. The check instruction must specify the same register that the advanced load instruction used. If the address is present in ALAT, execution continues. Otherwise, a reload (in the case of ld.c) or recovery code (in the case of chk.a) is executed. An entry in ALAT can be removed, for example, by a subsequent store that overlaps the load address. To determine this overlap, the size of the load in bytes is also maintained.

Control Speculation

When we want to reduce latencies of long latency instructions such as load, we advance them earlier into the code. When there is a branch instruction, it blocks such a move because we do not know whether the branch will be taken until we execute the branch instruction. Let us look at the following code fragment.

cmp.eq

 

p1,p0 = r10,10

// cycle 0

(p1) br.cond

skip ;;

// cycle 0

ld8

r1

= [r2] ;;

// cycle

1

add

r3

= r1,r4

// cycle

3

skip:

 

 

 

 

// other instructions

In the above code, we cannot advance the load instruction due to the branch instruction. Execution of the then branch takes four clock cycles. Note that the code implies that the integer compare instruction that sets the predicate register p1 and the branch instruction that tests it are executed in the same cycle. This is the only exception; in general, there should be a stop inserted between an instruction that is setting a predicate register and the subsequent instruction testing the same predicate.

Now the question is: how do we advance the load instruction past the branch instruction? We speculate in a manner similar to the way we handled data dependency. There is one additional problem: because the execution may not take the branch, if the speculative execution causes exceptions, they should not be raised. Instead, exceptions should be deferred until we know that the instruction will indeed be executed. The Not-a-Thing bit is used for this purpose. If a speculative execution causes an exception, the NaT bit associated with that register is set. When a check is made at the point of actual instruction execution, if the deferred exception is not present, speculative execution was successful. Otherwise, the speculative execution should be redone. Let us rewrite the previous code with a speculative load.

ld8.s

r1 = [r2] ;;

// cycle -2 or earlier

// other instructions

 

cmp.eq

p1,p0 = r10,10

// cycle 0

(p1) br.cond

skip

// cycle 0

chk.s

r1, recovery

// cycle 0

add r3 = r1,r4

// cycle 0

Chapter 7 Itanium Architecture

119

skip:

// other instructions

recovery:

ld8

r1 = [r2]

br

skip

The load instruction is moved by at least two cycles so that the value is available in r1 by the time it is needed by the add instruction. Because this is a speculative load, we use the ld8.s instruction. And, in place of the original load instruction, we insert a speculative check (chk.s) instruction with the same register r1. As in the data dependency example, if the speculative load is not successful (i.e., the NaT bit of r1 is set), the instruction has to be reexecuted. In this case, the check instruction branches to the recovery code.

Branch Prediction Hints

Most processors use branch prediction to handle the branch problem. We have discussed three branch prediction strategies—fixed, static, and dynamic—in Chapter 2. Branch prediction strategies can accept hints from the compiler. In the Itanium, branch hints can be explicitly provided in branch instructions. The bwh completer can take one of the following four values.

spnt Static branch not taken sptk Static branch taken

dpnt Dynamic branch not taken dpnt Dynamic branch taken

The SPARC processor also allows providing hints in its branch instructions (see Chapter 5 for details).

Summary

We have given a detailed overview of the Itanium architecture. The Itanium is a more advanced processor in terms of the features provided. Compared to the other processors we have discussed so far, it supports explicit parallelism by providing ample resources such as registers and functional units. In addition, the Itanium architecture supports sophisticated speculative loads and predication. It also uses traditional branch prediction strategies. The Itanium, even though it follows the RISC principles mentioned in Chapter 3, is not a “simple” architecture. Some of its instructions truly belong to the CISC category.

Web Resources

Certainly, we have left out several elements. If you are interested in learning more about this architecture, visit the Intel Web site for recent information. The Itanium architec-

120

Guide to RISC Processors

ture details and manuals are available from www.intel.com/design/Itanium/ manuals/. The architecture is described in three volumes [10, 11, 12].

8

ARM Architecture

In this chapter, we take a detailed look at the ARM architecture. Unlike the other architectures we have seen so far, ARM provides only 16 general-purpose registers. Even though it follows the RISC principles, it provides a large number of addressing modes. Furthermore, it has several complex instructions. The format of this chapter is similar to that we used in the last few chapters. We start with a brief background on the ARM architecture. We then present details on its registers and addressing modes. Following this, we give details on some sample ARM instructions. We conclude the chapter with a summary.

Introduction

The ARM architecture was developed in 1985 by Acorn Computer Group in the United Kingdom. Acorn introduced the first RISC processor in 1987, targeting low-cost PCs. In 1990, Acorn formed Advanced RISC Machines. ARM, which initially stood for Acorn RISC Machine but later changed to Advanced RISC Machine, defines a 32-bit RISC architecture.

ARM’s instruction set architecture has evolved over time. The first two versions had only 26-bit address space. Version 3 extended it to 32 bits. This version also introduced separate program status registers that we discuss in the next section. Version 4, ARMv4, is the oldest version supported today. Some implementations of this version include the ARM7™ core family and Intel StrongARM™ processors.

ARM architecture has been extended to support cost-sensitive embedded applications such as cell phones, modems, and pagers by introducing the Thumb instruction set. The Thumb instruction set is a subset of the most commonly used 32-bit ARM instructions. To reduce the memory requirements, the instructions are compressed into 16-bit wide encodings. To execute these 16-bit instructions, they are decompressed transparently to full 32-bit ARM instructions in real-time. The ARMv4T architecture has added these 16-bit Thumb instructions to produce compact code for handheld devices.

121

122

 

 

 

 

 

 

 

Guide to RISC Processors

 

 

 

 

 

General−Purpose Registers

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

PC

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

R14

 

R14_fiq

 

R14_irq

 

R14_svc

 

 

R13_abt

 

 

R13_und

 

 

R13

 

R13_fiq

 

R13_irq

 

R13_svc

 

 

R13_abt

 

 

R13_und

 

 

 

R12

 

R12_fiq

 

 

 

 

 

 

 

 

 

 

 

 

 

 

R11

 

R11_fiq

 

 

 

 

 

 

 

 

 

 

 

 

 

 

R10

 

R10_fiq

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Program Status Registers

 

 

R9

 

R9_fiq

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

R8

 

R8_fiq

 

 

 

 

 

 

Current Program Status Registers

 

 

R7

 

 

 

 

 

 

 

 

 

 

CPSR

 

 

 

 

R6

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

R5

 

 

 

 

 

 

 

 

Saved Program Status Registers

 

 

R4

 

 

 

 

 

 

 

 

 

 

SPSR_fiq

 

 

 

 

R3

 

 

 

 

 

 

 

 

 

 

SPSR_irq

 

 

 

 

R2

 

 

 

 

 

 

 

 

 

 

SPSR_svc

 

 

 

 

R1

 

 

 

 

 

 

 

 

 

 

SPSR_abt

 

 

 

 

R0

 

 

 

 

 

 

 

 

 

SPSR_und

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 8.1 ARM register set consists of 31 general-purpose register and six status registers. All registers are 32 bits wide.

In 1999, the ARMv5TE architecture introduced several improvements to the Thumb architecture. In addition, several DSP instructions have been added to the ARM ISA. In 2000, the ARMv5TEJ architecture added the Jazelle extension to support Java acceleration technology for small memory footprint designs. In 2001, the ARMv6 architecture was introduced. This version provides support for multimedia instructions to execute in Single Instruction Multiple Data (SIMD) mode.

Like the MIPS, the ARM is dominant in the 32-bit embedded RISC microprocessor market. ARM is used in several portable and handheld devices including Dell’s AXIM X5, Palm Tungsten T, RIM’s Blackberry, HP iPaq, Nintendo’s Gameboy Advance, and so on [4]. The ARM Web site claims that, by 2002, they shipped over 1 billion of its microprocessor cores [3].

In the remainder of this chapter we look at the ARM architecture in detail. After reading this chapter, you will notice that this architecture is somewhat different from the other processors we have seen so far. It shares some features with the Itanium architecture discussed in the last chapter. However, you should note that the ARM architecture development started in 1985.

Соседние файлы в папке MIPS_primery_zadach