
Furber S.ARM system-on-chip architecture.2000
.pdf
380 |
The AMULET Asynchronous ARM Processors |
If a subsequent instruction requests r!2 as a source operand, an inspection of the FIFO column corresponding to r!2 (outlined in the figure) reveals whether or not r!2 is valid. AT anywhere in the column signifies a pending write to r!2, so its current value is obsolete. The read waits until the ' 1' is cleared, then it can proceed.
The 'inspection' is implemented in hardware by a logical 'OR' function across the column for each register. This may appear hazardous since the data in the FIFO may move down the FIFO while the 'OR' output is being used. However, data moves in an asynchronous FIFO by being duplicated from one stage into the next, only then is it removed from the first stage, so a propagating ' 1' will appear alternately in one or two positions and it will never disappear completely. The 'OR' output will therefore be stable even though the data is moving.
AMULET 1 depends entirely on the register locking mechanism to maintain register coherency, and as a result the execution pipeline is stalled quite frequently in typical code. (Register dependencies between consecutive instructions are common in typical code since the compiler makes no attempt to avoid them because standard ARM processors are insensitive to such dependencies.)
AMULET! |
AMULET 1 was developed to demonstrate the feasibility of designing a fully asyn- |
||||
performance |
chronous implementation of a commercial microprocessor architecture. The proto- |
||||
|
type chips were functional and ran test programs generated using standard ARM |
||||
|
development tools. The performance of the prototypes is summarized in Table 14.1, |
||||
|
which shows the characteristics of the devices manufactured on two different pro- |
||||
|
cess technologies by European Silicon Systems (ES2) and GEC Plessey Semicon- |
||||
|
ductors (GPS). The layout of the 1 |im AMULET1 core is shown in Figure 14.5. The |
||||
|
performance figures, based on the Dhrystone benchmark, show a performance |
||||
|
which is of the same order as, but certainly no better than, an ARM6 processor built |
||||
|
Table 14.1 |
AMULET1 characteristics. |
|||
|
|
|
|
|
|
|
AMULET1/ES2 |
|
AMULET1/GPS |
ARM6 |
|
|
|
|
|
|
|
Process Area |
1 um |
|
0.7 |im |
1 um |
|
(mm2) |
5.5x4.1 |
|
3.9x2.9 |
4.1x2.7 |
|
Transistors |
58,374 |
|
58,374 |
33,494 |
|
Performance |
20.5 kDhrystones |
|
~40 kDhrystones3 3 |
31 kDhrystones |
|
Multiplier |
5.3 ns/bit 5V,20°C |
|
ns/bit 5V,20°C |
25 ns/bit 5 V, 20 |
|
Conditions |
152mW |
|
N/Ab |
MHz 148 mW |
|
Power |
|
|
|
|
|
MIPS/W |
77 |
|
N/A |
120 |
|
|
|
|
|
||
a. Estimated maximum performance, b. The GPS part |
|
|
|||
does not support power measurement. |
|
|
|||
|
|
|
|
|
|

AMULET2 |
381 |
• |
|
Figure 14.5 AMULET1 die plot.
on the same process technology. However, AMULET 1 was built primarily to demonstrate the feasibility of self-timed design, which it manifestly does.
14.3AMULET2
|
AMULET2 is the second-generation asynchronous ARM processor. It employs an |
|
organization which is very similar to that used in AMULET 1, as illustrated in |
|
Figure 14.3 on page 378. As described earlier, the two-phase (transition) signalling |
|
used on AMULET 1 was abandoned in favour of four-phase (level) signalling. In |
|
addition, a number of organizational features were added to enhance performance. |
AMULET2 |
AMULET2 employs the same register-locking mechanism as AMULET 1, but in |
register |
order to reduce the performance loss due to register dependency stalls, it also incor- |
forwarding |
porates forwarding mechanisms to handle common cases. The bypass mechanisms |
|
used in clocked processor pipelines are inapplicable to asynchronous pipelines, so |
|
novel techniques are required. The two techniques used on AMULET2 are: |
|
• A last result register. The instruction decoder keeps a record of the destination of |
|
the result from the execution pipeline, and if the immediately following instruc |
|
tion uses this register as a source operand the register read phase is bypassed and |
|
the value is collected from the last result register. |
|
• A last loaded data register. The instruction decoder keeps a record of the destination |
|
of the last data item loaded from memory, and whenever this register is used as a |
|
source operand the register read phase is bypassed and the value is picked up directly |

382 The AMULET Asynchronous ARM Processors
|
from the last loaded data register. A mechanism similar to the lock FIFO serves as a |
|
guard on the register to ensure that the correct value is collected. |
|
Both these mechanisms rely on the required result being available; where there is |
|
some uncertainty (for example when the result is produced by an instruction which is |
|
conditionally executed) the instruction decoder can fall back on the locking mecha- |
|
nism, exploiting the ability of the asynchronous organization to cope with variable |
|
delays in the supply of the operands. |
AMULET2 jump |
AMULET 1 prefetches instructions sequentially from the current PC value and all devia- |
trace buffer |
tions from sequential execution must be issued as corrections from the execution pipe- |
|
line to the address interface. Every time the PC has to be corrected performance is lost |
|
and energy is wasted in prefetching instructions that are then discarded. |
|
AMULET2 attempts to reduce this inefficiency by remembering where branches |
|
were previously taken and guessing that control will subsequently follow the same |
|
path. The organization of the jump trace buffer is shown in Figure 14.6; it is similar to |
|
that used on the MU5 mainframe computer developed at the University of Manchester |
|
between 1969 and 1974 (which also operated with asynchronous control). |
Figure 14.6 The AMULET2 jump trace buffer.
The buffer caches the program counters and targets of recently taken branch instructions, and whenever it spots an instruction fetch from an address that it has stored it modifies the predicted control flow from sequential to the previous branch target. If this prediction turns out to be correct, exactly the right instruction sequence is fetched from memory; if it turns out to be wrong and the branch should not have
AMULET2 |
383 |
been taken, the branch is executed as an 'unbranch' instruction to return to the previous sequential flow.
Branch Statistics |
The effectiveness of the jump trace buffer depends on the statistics of typical |
||||
|
branch behaviour. Typical figures are shown in Table 14.2. |
||||
|
Table 14.2 |
AMULET2 branch prediction statistics. |
|||
|
|
|
|
|
|
Prediction algorithm |
Correct |
Incorrect |
|
Redundant |
|
fetches |
|
|
|
|
|
Sequential |
33% |
67% |
2 per branch |
|
|
Trace buffer |
67% |
(ave.) 33% |
1 per |
|
|
|
|
|
|
|
|
In the absence of the jump trace buffer, the default sequential fetch pattern is equivalent to predicting that all branches are not taken. This is correct for one-third of all branches, and incorrect for the remaining two-thirds. A jump trace buffer with around 20 entries reverses these proportions, correctly predicting around two-thirds of all branches and mispredicting or failing to predict around one-third.
Although the depth of prefetching beyond a branch is non-deterministic, a prefetch depth of around three instructions is observed on AMULET2. The fetched instructions are used when the branch is predicted correctly, but are discarded when the branch is mispredicted or not predicted. The jump trace buffer therefore reduces the average number of redundant fetches per branch from two to one. Since branches occur around once every five instructions in typical code, the jump trace buffer may be expected to reduce the instruction fetch bandwidth by around 20% and the total memory bandwidth by 10% to 15%.
Where the system performance is limited by the available memory bandwidth, this saving translates directly into improved performance; in any case it represents a power saving due to the elimination of redundant activity.
'Halt' |
Unlike some other microprocessors, the ARM does not have an explicit 'Halt' |
||||
|
instruction. Instead, when a program can find no more useful work to do it usually |
||||
|
enters an idle loop, executing: |
|
|
|
|
|
B |
. |
; |
loop until |
interrupted |
Here the '.' denotes the current PC, so the branch target is the branch instruction itself, and the program sits in this single instruction loop until an interrupt causes it to do something else.
Clearly the processor is doing no useful work while in this idle loop, so any power it uses is wasted. AMULET2 detects the opcode corresponding to a branch which loops to itself and uses this to stall a signal at one point in the asynchronous control network. This stall rapidly propagates throughout the control, bringing the processor

384 |
The AMULET Asynchronous ARM Processors |
to an inactive, zero power state. An active interrupt request releases the stall, enabling the processor to resume normal throughput immediately.
AMULET2 can therefore switch between zero power and maximum throughput states at a very high rate and with no software overhead; indeed, much existing ARM code will give optimum power-efficiency using this scheme even though it was not written with the scheme in mind. This makes the processor very applicable to low-power applications with bursty real-time load characteristics.
14.4AMULET2e
|
AMULET2e is an AMULET2 processor core (see Section 14.3 on page 381) com- |
|
bined with 4 Kbytes of memory, which can be configured either as a cache or a fixed |
|
RAM area, and a flexible memory interface (the funnel) which allows 8-, 16or |
|
32-bit external devices to be connected directly, including memories built from |
|
DRAM. The internal organization of AMULET2e is illustrated in Figure 14.7. |
AMULET2e |
The cache comprises four 1 Kbyte blocks, each of which is a fully associative |
cache |
random replacement store with a quad-word line and block size. A pipeline register |
|
between the CAM and the RAM sections allows a following access to begin its |
|
CAM look-up while the previous access completes within the RAM; this exploits |
Figure 14.7 AMULET2e internal organization.

AMULET2e |
385 |
the ability of the AMULET2 core to issue multiple memory requests before the data is returned from the first. Sequential accesses are detected and bypass the CAM look-up, thereby saving power and improving performance.
|
Cache line fetches are non-blocking, accessing the addressed item first and then |
|
allowing the processor to continue while the rest of the line is fetched. The line fetch |
|
automaton continues loading the line fetch buffer while the processor accesses the |
|
cache. There is an additional CAM entry that identifies references to the data which is |
|
stored in the line fetch buffer. Indeed, this data remains in the line fetch buffer where it |
|
can be accessed on equal terms to data in the cache until the next cache miss, where- |
|
upon the whole buffer is copied into the cache while the new data is loaded from |
|
external memory into the line fetch buffer. |
AMULET2e |
A plot of the AMULET2e die is shown in Figure 14.8 and the characteristics of the |
silicon |
device are summarized in Table 14.3 on page 386. The AMULET2 core uses 93,000 |
|
transistors, the cache 328,000 and the remainder are in the control logic and pads. |
Figure 14.8 AMULET2e die plot.
Timing reference The absence of a reference clock in an asynchronous system makes timing memory accesses an issue that requires careful consideration. The solution incorporated into AMULET2e uses a single external reference delay connected directly to the chip and configuration registers, loaded at start-up, which specify the organization and timing properties of each memory region. The reference delay could, for example, reflect the external SRAM access time, so the RAM will be configured to take one

386 The AMULET Asynchronous ARM Processors
Table 14.3 AMULET2e characteristics.
Process |
0.5 urn |
Transistors |
454,000 |
MIPS |
40 |
Metal layers |
3 |
Die area |
41mm2 |
Power |
140 mW |
Vdd |
3.3V |
Clock |
none |
MIPS/W |
285 |
|
|
|
|
|
|
|
reference delay. The slower ROM may be configured to take several reference |
|
delays. (Note that the reference delay is only used for off-chip timing; all on-chip |
|
delays are self-timed.) |
AMULET26 |
AMULET2e has been configured to make building small systems as straightforward |
systems |
as possible. As an example, Figure 14.9 shows a test card |
incorporating |
AMULET2e. The only components, apart from AMULET2e itself, are four SRAM |
|
|
|
chips, one ROM chip, a UART and an RS232 line interface. The UART uses a crys- |
|
tal oscillator to control its bit rate, but all the system timing functions are controlled |
|
byAMULET2e. |
|
The ROM contains the standard ARM 'Angel' code and the host computer at the |
|
other end of the RS232 serial line runs the ARM development tools. This system dem- |
|
onstrates that using an asynchronous processor need be no more difficult than using a |
|
clocked processor provided that the memory interface has been carefully thought out. |
chip selects 18 |
8 |
18 |
8 |
Figure 14.9 AMULET2e test card organization.
AMULET3 |
387 |
14.5AMULETS
|
AMULETS is being developed to establish the commercial viability of asynchro- |
|
nous design. Like its predecessors, AMULETS is a full-functionality |
|
ARM-compatible microprocessor with support for interrupts and memory |
|
faults. AMULET 1 and AMULET2 implemented the ARM6 architecture (ARM |
|
architecture version 3G). AMULETS supports ARM architecture version 4T, |
|
including the 16-bit Thumb instruction set. |
Performance |
The objective of the AMULETS project was to produce an asynchronous implemen- |
Objective |
tation of ARM architecture v4T which is competitive in terms of power-efficiency |
|
and performance with the ARM9TDMI. This implies a performance target of over |
|
100 MIPS (measured using Dhrystone 2.1) on a 0.35 (im process, compared to the |
|
40 MIPS delivered by AMULET2e on a 0.5 um process. |
|
Increasing the performance by more than a factor of two requires a radical change |
|
to the core organization. As with a clocked processor, the basic approach is based on a |
|
combination of increasing the cycle rate of the processor pipeline and decreasing the |
|
average number of cycles per instruction. Here, however, the 'cycles' are not defined |
|
by an external clock and are not of fixed duration. |
AMULET3 core |
The organization of AMULETS is illustrated in Figure 14.10 on page 388. The six |
organization |
principal pipeline stages are as follows: |
|
• the instruction prefetch unit, which includes a branch target buffer; |
|
• the instruction decode, register read and forwarding stage; |
|
• the execute stage, which includes the shifter, multiplier and ALU; |
|
• the data memory interface; |
|
• the reorder buffer; |
|
• the register result write-back stage. |
|
The core employs a Harvard architecture (separate instruction and data memory |
|
ports) which provides the higher memory bandwidth required to support higher per- |
|
formance. Only those instructions that require data memory access (loads and stores) |
|
pass through the data memory, so the pipeline depth is greater for these instructions |
|
than, for example, for simple data processing instructions. |
|
Further details on each of the pipeline stages are given below. |
Instruction |
The prefetch unit operates autonomously, fetching 32-bit (one ARM or two Thumb) |
prefetch unit |
instruction packets from memory whenever it has room to store them. |
|
It incorporates a branch prediction unit that is similar to the jump trace buffer used in |
|
AMULET2, but has extensions to support branch prediction in Thumb code. A two- |
