Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Furber S.ARM system-on-chip architecture.2000

.pdf
Скачиваний:
93
Добавлен:
23.08.2013
Размер:
18.35 Mб
Скачать

AMULET1

379

Register coherency

Register locking

If the execution pipeline is to work efficiently, the register file must be able to issue the operands for an instruction before the result of the previous instruction has returned. However, in some cases an operand may depend on a preceding result (this is a read-after-write hazard), in which case the register file cannot issue the operand until the result has returned unless there is a forwarding mechanism to supply the correct value further down the pipeline.

The register forwarding mechanism used on many RISC processors (including ARMS and StrongARM) is based upon the characteristics of a synchronous pipeline since it involves comparing the source operand register number in one pipeline stage with the destination register number in another stage. In an asynchronous pipeline the stages are all moving at different times and such a comparison can only be made by introducing explicit synchronization between the stages concerned, thereby losing most of the benefits of asynchronous operation.

On AMULET 1 register coherency is achieved through a novel form of register locking, based on a register lock FIFO (first-in-first-out queue). The destination register numbers are stored, in decoded form, in a FIFO, until the associated result is returned from the execution or memory pipeline to the register bank.

The organization of the lock FIFO is illustrated in Figure 14.4. Each stage of the FIFO holds a ' 1' in the position corresponding to the destination register. In this figure the FIFO controls 16 registers (in AMULET 1 the FIFO has a column for each physical register, including the ARM banked registers) and is shown in a state where the first result to arrive will be written into rO, the second into r2, the third into r!2 and the fourth FIFO stage is empty.

Figure 14.4 AMULET register lock FIFO organization.

380

The AMULET Asynchronous ARM Processors

If a subsequent instruction requests r!2 as a source operand, an inspection of the FIFO column corresponding to r!2 (outlined in the figure) reveals whether or not r!2 is valid. AT anywhere in the column signifies a pending write to r!2, so its current value is obsolete. The read waits until the ' 1' is cleared, then it can proceed.

The 'inspection' is implemented in hardware by a logical 'OR' function across the column for each register. This may appear hazardous since the data in the FIFO may move down the FIFO while the 'OR' output is being used. However, data moves in an asynchronous FIFO by being duplicated from one stage into the next, only then is it removed from the first stage, so a propagating ' 1' will appear alternately in one or two positions and it will never disappear completely. The 'OR' output will therefore be stable even though the data is moving.

AMULET 1 depends entirely on the register locking mechanism to maintain register coherency, and as a result the execution pipeline is stalled quite frequently in typical code. (Register dependencies between consecutive instructions are common in typical code since the compiler makes no attempt to avoid them because standard ARM processors are insensitive to such dependencies.)

AMULET!

AMULET 1 was developed to demonstrate the feasibility of designing a fully asyn-

performance

chronous implementation of a commercial microprocessor architecture. The proto-

 

type chips were functional and ran test programs generated using standard ARM

 

development tools. The performance of the prototypes is summarized in Table 14.1,

 

which shows the characteristics of the devices manufactured on two different pro-

 

cess technologies by European Silicon Systems (ES2) and GEC Plessey Semicon-

 

ductors (GPS). The layout of the 1 |im AMULET1 core is shown in Figure 14.5. The

 

performance figures, based on the Dhrystone benchmark, show a performance

 

which is of the same order as, but certainly no better than, an ARM6 processor built

 

Table 14.1

AMULET1 characteristics.

 

 

 

 

 

 

 

AMULET1/ES2

 

AMULET1/GPS

ARM6

 

 

 

 

 

 

 

Process Area

1 um

 

0.7 |im

1 um

 

(mm2)

5.5x4.1

 

3.9x2.9

4.1x2.7

 

Transistors

58,374

 

58,374

33,494

 

Performance

20.5 kDhrystones

 

~40 kDhrystones3 3

31 kDhrystones

 

Multiplier

5.3 ns/bit 5V,20°C

 

ns/bit 5V,20°C

25 ns/bit 5 V, 20

 

Conditions

152mW

 

N/Ab

MHz 148 mW

 

Power

 

 

 

 

 

MIPS/W

77

 

N/A

120

 

 

 

 

 

a. Estimated maximum performance, b. The GPS part

 

 

does not support power measurement.

 

 

 

 

 

 

 

 

AMULET2

381

 

Figure 14.5 AMULET1 die plot.

on the same process technology. However, AMULET 1 was built primarily to demonstrate the feasibility of self-timed design, which it manifestly does.

14.3AMULET2

 

AMULET2 is the second-generation asynchronous ARM processor. It employs an

 

organization which is very similar to that used in AMULET 1, as illustrated in

 

Figure 14.3 on page 378. As described earlier, the two-phase (transition) signalling

 

used on AMULET 1 was abandoned in favour of four-phase (level) signalling. In

 

addition, a number of organizational features were added to enhance performance.

AMULET2

AMULET2 employs the same register-locking mechanism as AMULET 1, but in

register

order to reduce the performance loss due to register dependency stalls, it also incor-

forwarding

porates forwarding mechanisms to handle common cases. The bypass mechanisms

 

used in clocked processor pipelines are inapplicable to asynchronous pipelines, so

 

novel techniques are required. The two techniques used on AMULET2 are:

 

• A last result register. The instruction decoder keeps a record of the destination of

 

the result from the execution pipeline, and if the immediately following instruc

 

tion uses this register as a source operand the register read phase is bypassed and

 

the value is collected from the last result register.

 

• A last loaded data register. The instruction decoder keeps a record of the destination

 

of the last data item loaded from memory, and whenever this register is used as a

 

source operand the register read phase is bypassed and the value is picked up directly

382 The AMULET Asynchronous ARM Processors

 

from the last loaded data register. A mechanism similar to the lock FIFO serves as a

 

guard on the register to ensure that the correct value is collected.

 

Both these mechanisms rely on the required result being available; where there is

 

some uncertainty (for example when the result is produced by an instruction which is

 

conditionally executed) the instruction decoder can fall back on the locking mecha-

 

nism, exploiting the ability of the asynchronous organization to cope with variable

 

delays in the supply of the operands.

AMULET2 jump

AMULET 1 prefetches instructions sequentially from the current PC value and all devia-

trace buffer

tions from sequential execution must be issued as corrections from the execution pipe-

 

line to the address interface. Every time the PC has to be corrected performance is lost

 

and energy is wasted in prefetching instructions that are then discarded.

 

AMULET2 attempts to reduce this inefficiency by remembering where branches

 

were previously taken and guessing that control will subsequently follow the same

 

path. The organization of the jump trace buffer is shown in Figure 14.6; it is similar to

 

that used on the MU5 mainframe computer developed at the University of Manchester

 

between 1969 and 1974 (which also operated with asynchronous control).

Figure 14.6 The AMULET2 jump trace buffer.

The buffer caches the program counters and targets of recently taken branch instructions, and whenever it spots an instruction fetch from an address that it has stored it modifies the predicted control flow from sequential to the previous branch target. If this prediction turns out to be correct, exactly the right instruction sequence is fetched from memory; if it turns out to be wrong and the branch should not have

AMULET2

383

been taken, the branch is executed as an 'unbranch' instruction to return to the previous sequential flow.

Branch Statistics

The effectiveness of the jump trace buffer depends on the statistics of typical

 

branch behaviour. Typical figures are shown in Table 14.2.

 

Table 14.2

AMULET2 branch prediction statistics.

 

 

 

 

 

 

Prediction algorithm

Correct

Incorrect

 

Redundant

 

fetches

 

 

 

 

 

Sequential

33%

67%

2 per branch

 

Trace buffer

67%

(ave.) 33%

1 per

 

 

 

 

 

 

 

In the absence of the jump trace buffer, the default sequential fetch pattern is equivalent to predicting that all branches are not taken. This is correct for one-third of all branches, and incorrect for the remaining two-thirds. A jump trace buffer with around 20 entries reverses these proportions, correctly predicting around two-thirds of all branches and mispredicting or failing to predict around one-third.

Although the depth of prefetching beyond a branch is non-deterministic, a prefetch depth of around three instructions is observed on AMULET2. The fetched instructions are used when the branch is predicted correctly, but are discarded when the branch is mispredicted or not predicted. The jump trace buffer therefore reduces the average number of redundant fetches per branch from two to one. Since branches occur around once every five instructions in typical code, the jump trace buffer may be expected to reduce the instruction fetch bandwidth by around 20% and the total memory bandwidth by 10% to 15%.

Where the system performance is limited by the available memory bandwidth, this saving translates directly into improved performance; in any case it represents a power saving due to the elimination of redundant activity.

'Halt'

Unlike some other microprocessors, the ARM does not have an explicit 'Halt'

 

instruction. Instead, when a program can find no more useful work to do it usually

 

enters an idle loop, executing:

 

 

 

 

B

.

;

loop until

interrupted

Here the '.' denotes the current PC, so the branch target is the branch instruction itself, and the program sits in this single instruction loop until an interrupt causes it to do something else.

Clearly the processor is doing no useful work while in this idle loop, so any power it uses is wasted. AMULET2 detects the opcode corresponding to a branch which loops to itself and uses this to stall a signal at one point in the asynchronous control network. This stall rapidly propagates throughout the control, bringing the processor

384

The AMULET Asynchronous ARM Processors

to an inactive, zero power state. An active interrupt request releases the stall, enabling the processor to resume normal throughput immediately.

AMULET2 can therefore switch between zero power and maximum throughput states at a very high rate and with no software overhead; indeed, much existing ARM code will give optimum power-efficiency using this scheme even though it was not written with the scheme in mind. This makes the processor very applicable to low-power applications with bursty real-time load characteristics.

14.4AMULET2e

 

AMULET2e is an AMULET2 processor core (see Section 14.3 on page 381) com-

 

bined with 4 Kbytes of memory, which can be configured either as a cache or a fixed

 

RAM area, and a flexible memory interface (the funnel) which allows 8-, 16or

 

32-bit external devices to be connected directly, including memories built from

 

DRAM. The internal organization of AMULET2e is illustrated in Figure 14.7.

AMULET2e

The cache comprises four 1 Kbyte blocks, each of which is a fully associative

cache

random replacement store with a quad-word line and block size. A pipeline register

 

between the CAM and the RAM sections allows a following access to begin its

 

CAM look-up while the previous access completes within the RAM; this exploits

Figure 14.7 AMULET2e internal organization.

AMULET2e

385

the ability of the AMULET2 core to issue multiple memory requests before the data is returned from the first. Sequential accesses are detected and bypass the CAM look-up, thereby saving power and improving performance.

 

Cache line fetches are non-blocking, accessing the addressed item first and then

 

allowing the processor to continue while the rest of the line is fetched. The line fetch

 

automaton continues loading the line fetch buffer while the processor accesses the

 

cache. There is an additional CAM entry that identifies references to the data which is

 

stored in the line fetch buffer. Indeed, this data remains in the line fetch buffer where it

 

can be accessed on equal terms to data in the cache until the next cache miss, where-

 

upon the whole buffer is copied into the cache while the new data is loaded from

 

external memory into the line fetch buffer.

AMULET2e

A plot of the AMULET2e die is shown in Figure 14.8 and the characteristics of the

silicon

device are summarized in Table 14.3 on page 386. The AMULET2 core uses 93,000

 

transistors, the cache 328,000 and the remainder are in the control logic and pads.

Figure 14.8 AMULET2e die plot.

Timing reference The absence of a reference clock in an asynchronous system makes timing memory accesses an issue that requires careful consideration. The solution incorporated into AMULET2e uses a single external reference delay connected directly to the chip and configuration registers, loaded at start-up, which specify the organization and timing properties of each memory region. The reference delay could, for example, reflect the external SRAM access time, so the RAM will be configured to take one

386 The AMULET Asynchronous ARM Processors

Table 14.3 AMULET2e characteristics.

Process

0.5 urn

Transistors

454,000

MIPS

40

Metal layers

3

Die area

41mm2

Power

140 mW

Vdd

3.3V

Clock

none

MIPS/W

285

 

 

 

 

 

 

 

reference delay. The slower ROM may be configured to take several reference

 

delays. (Note that the reference delay is only used for off-chip timing; all on-chip

 

delays are self-timed.)

AMULET26

AMULET2e has been configured to make building small systems as straightforward

systems

as possible. As an example, Figure 14.9 shows a test card

incorporating

AMULET2e. The only components, apart from AMULET2e itself, are four SRAM

 

 

chips, one ROM chip, a UART and an RS232 line interface. The UART uses a crys-

 

tal oscillator to control its bit rate, but all the system timing functions are controlled

 

byAMULET2e.

 

The ROM contains the standard ARM 'Angel' code and the host computer at the

 

other end of the RS232 serial line runs the ARM development tools. This system dem-

 

onstrates that using an asynchronous processor need be no more difficult than using a

 

clocked processor provided that the memory interface has been carefully thought out.

chip selects 18

8

18

8

Figure 14.9 AMULET2e test card organization.

AMULET3

387

14.5AMULETS

 

AMULETS is being developed to establish the commercial viability of asynchro-

 

nous design. Like its predecessors, AMULETS is a full-functionality

 

ARM-compatible microprocessor with support for interrupts and memory

 

faults. AMULET 1 and AMULET2 implemented the ARM6 architecture (ARM

 

architecture version 3G). AMULETS supports ARM architecture version 4T,

 

including the 16-bit Thumb instruction set.

Performance

The objective of the AMULETS project was to produce an asynchronous implemen-

Objective

tation of ARM architecture v4T which is competitive in terms of power-efficiency

 

and performance with the ARM9TDMI. This implies a performance target of over

 

100 MIPS (measured using Dhrystone 2.1) on a 0.35 (im process, compared to the

 

40 MIPS delivered by AMULET2e on a 0.5 um process.

 

Increasing the performance by more than a factor of two requires a radical change

 

to the core organization. As with a clocked processor, the basic approach is based on a

 

combination of increasing the cycle rate of the processor pipeline and decreasing the

 

average number of cycles per instruction. Here, however, the 'cycles' are not defined

 

by an external clock and are not of fixed duration.

AMULET3 core

The organization of AMULETS is illustrated in Figure 14.10 on page 388. The six

organization

principal pipeline stages are as follows:

 

• the instruction prefetch unit, which includes a branch target buffer;

 

• the instruction decode, register read and forwarding stage;

 

• the execute stage, which includes the shifter, multiplier and ALU;

 

• the data memory interface;

 

• the reorder buffer;

 

• the register result write-back stage.

 

The core employs a Harvard architecture (separate instruction and data memory

 

ports) which provides the higher memory bandwidth required to support higher per-

 

formance. Only those instructions that require data memory access (loads and stores)

 

pass through the data memory, so the pipeline depth is greater for these instructions

 

than, for example, for simple data processing instructions.

 

Further details on each of the pipeline stages are given below.

Instruction

The prefetch unit operates autonomously, fetching 32-bit (one ARM or two Thumb)

prefetch unit

instruction packets from memory whenever it has room to store them.

 

It incorporates a branch prediction unit that is similar to the jump trace buffer used in

 

AMULET2, but has extensions to support branch prediction in Thumb code. A two-

388

The AMULET Asynchronous ARM Processors

Figure 14.10 AMULETS core organization.

instruction Thumb packet may contain zero, one or two branch instructions, and the branch predictor must handle all of these cases. When executing Thumb code the 16-entry branch prediction unit works in two 8-entry halves, with branches at even half-word addresses being stored in one half and branches at odd half-word addresses in the other. A packet with a branch in each half word may 'hit' in both halves, whereupon the target of the first instruction (at the even address) takes priority.

AMULET3 includes a zero-power 'Halt' instruction, as did AMULET2. Here the halt takes effect in the prefetch unit rather than the execution unit, giving a faster resumption of full throughput when an interrupt occurs. Interrupts themselves are also handled here, again giving reduced latency.

Decode and register read

The decode stage includes Thumb and ARM decode logic, and the register read and forwarding mechanisms.

The ARM7TDMI implemented the Thumb decode function by first converting Thumb instructions into their ARM equivalents, and then performing ARM decode. ARM9TDMI reduced decode latency by decoding Thumb instructions directly. AMULET3 adopts a middle way, decoding certain time-critical control signals directly and others via the ARM instruction decoder. The asynchronous pipeline