
Furber S.ARM system-on-chip architecture.2000
.pdf

372 Embedded ARM Applications
coefficients and order the multiplication so that the coefficient determines the number of cycles, the multiply will require eight internal cycles. Each load also requires an internal cycle.
|
If the data and coefficients are always in on-chip memory, the loop takes 19 clock |
|
cycles if executed from on-chip RAM, with 14 additional wait state cycles if executed |
|
from two wait state off-chip RAM. |
|
The use of on-chip RAM therefore speeds the loop up by around 75%. |
|
Note that this speed-up is a lot less than the factor three that might be expected |
|
from simply comparing memory access speeds. |
Exercise 13.1.1 |
Estimate the power saving that results from using the on-chip RAM. (Assume that |
|
on-chip accesses cost 2 nJ and off-chip accesses 10 nJ.) |
Exercise 13.1.2 |
Repeat the estimates assuming the external memory is restricted to!6 bits wide. |
Exercise 13.1.3 |
Repeat the estimates for both 32and 16-bit external memory assuming that the |
|
processor supports Thumb mode (and the program is rewritten to use Thumb |
|
instructions) and high-speed multiplication. |
Example 13.2 |
Survey the designs presented in this chapter and summarize the |
|
power-saving techniques employed on the system chips |
|
described here. |
|
Firstly, all the system chips display a high level of integration which saves power |
|
whenever two modules on the same chip exchange data. |
|
All the chips are based around an ARM core which is, itself, very power-efficient. |
|
Though none of the chips has enough on-chip memory to remove the need for |
|
off-chip memory completely, several have a few kilobytes of on-chip RAM which can |
|
be loaded with critical routines to save power (and improve performance). |
|
Several of the chips incorporate clock control circuitry which reduces the clock fre- |
|
quency (or stops the clock altogether) when the chip is not fully loaded, and switches |
|
the clock to individual modules off when they are inactive. |
|
The chips are mainly based upon static or pseudo-static CMOS technology where, |
|
with a little care, digital circuits only consume power when they switch. Some of the |
|
chips also incorporate analogue functions which require continuous bias currents, so |
|
these circuits can usually be powered down when they are inactive. |
Exercise 13.2.1 |
In embedded systems there is usually no 'reset' button to restart a system that has |
|
crashed. What techniques are used to recover from software crashes? |
Examples and exercises |
373 |
Exercise 13.2.2 |
Several of the chips described in this chapter incorporate counter/timer modules. |
|
What are these used for? |
Exercise 13.2.3 |
The ARM7500 has an on-chip cache, AMULET2e (see Section 14.4 on page 384) |
|
has on-chip memory which can be configured either as a cache or as directly |
|
addressed memory, and Ruby II and VIP simply have directly addressed on-chip |
|
memory. Explain the difference between on-chip RAM and an on-chip cache and |
|
explain why the designers of these chips made the choices that they did. |

The AMULET Asynchronous
ARM Processors
Summary of chapter contents
The AMULET processor cores are fully asynchronous implementations of the ARM architecture, which means that they are self-timed and operate without any externally supplied clock.
Self-timed digital systems have potential advantages in the areas of power-efficiency and electromagnetic compatibility (EMC), but they require a radical redesign of the system architecture if the asynchronous timing framework is to be fully exploited, and designers need to learn a new mind set if they are to work in this style. Support for self-timed design from the tools industry is also lacking.
The AMULET processor cores are research prototypes, developed at the University of Manchester in England, and their commercial future is not yet clear. There is insufficient space in this book to present complete descriptions of them, but they are outlined here since they represent another, very different way to implement the ARM instruction set. See the 'Bibliography' on page 410 for references where further details on AMULET1 and AMULET2e may be found. Fuller details on AMULETS will be published in due course.
The latest self-timed core, AMULET3, incorporates most of the features found in the clocked ARM cores, including support for the Thumb instruction set, debug hardware, an on-chip macrocell bus which supports system-on-chip modular design, and so on. It brings the asynchronous technology up to the point where it is ready to enter commercial use, and the next few years will tell if this technology has a role to play in the future of the ARM architecture.
374

376 The AMULET Asynchronous ARM Processors
In addition, an asynchronous circuit has the potential to deliver typical rather than worst-case performance since its timing adjusts to actual conditions whereas a clocked circuit must be toleranced for worst-case conditions.
|
The AMULET series of asynchronous microprocessors was developed at the |
|
University of Manchester, in England, as one aspect of the growing global research |
|
into asynchronous design. There are other examples of asynchronous microprocessor |
|
developed elsewhere which support other instruction set architectures, but only the |
|
AMULET processors implement the ARM instruction set. |
Self-timed |
Asynchronous design is a complex discipline with many different facets and many |
signalling |
different approaches. It is outside the scope of this book to offer any general intro- |
|
duction to asynchronous design, but a few basic concepts should enable the reader |
|
to come to grips with the most important features of the AMULET cores. The fore- |
|
most of these concepts is the idea of asynchronous communication. How is the flow |
|
of data controlled in the absence of any reference clock? |
|
The AMULET designs both use forms of the Request—Acknowledge handshake to |
|
control the flow of data. The sequence of actions comprising the communication of |
|
data from the Sender to the Receiver is as follows: |
|
1. The Sender places a valid data value onto a bus. |
|
2. The Sender then issues a Request event. |
|
3. The Receiver accepts the data when it is ready to do so. |
|
4. The Receiver issues an Acknowledge event to the Sender. |
|
5. The Sender may then remove the data from the bus and begin the next communi |
|
cation when it is ready to do so. |
|
The data is passed along the bus using a conventional binary encoding, but there |
|
are two ways that the Request and Acknowledge events may be signalled: |
Transition |
• AMULET 1 uses transition encoding where a change in level (either high to low |
signalling |
or low to high) signals an event; this is illustrated in Figure 14.1. |
Figure 14.1 Transition-signalling communication protocol.

AMULET1 |
377 |
Level signalling AMULET2 and AMULETS use level encoding where a rising edge signals an event and a return-to-zero phase must occur before the next event can be signalled; this is illustrated in Figure 14.2.
|
Figure 14.2 Level-signalling communication protocol. |
|
Transition signalling was used on AMULET 1 since it is conceptually cleaner; |
|
every transition has a role and its timing is therefore determined by the circuit's |
|
function. It also uses the minimum number of transitions, and should therefore be |
|
power-efficient. However, the CMOS circuits used to implement transition control are |
|
relatively slow and inefficient, so AMULET2 and AMULETS use level signalling |
|
which employs circuits which are faster and more power-efficient despite using twice |
|
the number of transitions, but leave somewhat arbitrary decisions to be taken about the |
|
timing of the recovery (return-to-zero) phases in the protocol. |
Self-timed |
An asynchronous pipelined processing unit can be constructed using self-timing tech- |
pipelines |
niques to allow for the processing delay in each stage and one of the above protocols |
|
to send the result to the next stage. When the circuit is correctly designed variable |
|
processing and external delays can be accommodated; all that matters is the local |
|
sequencing of events (though long delays will, of course, lead to low performance). |
|
Unlike a clocked pipeline, where the whole pipeline must always be clocked at a |
|
rate determined by the slowest stage under worst-case environmental (voltage and |
|
temperature) conditions and assuming worst-case data, an asynchronous pipeline will |
|
operate at a variable rate determined by current conditions. It is possible to allow rare |
|
worst-case conditions to cause a processing unit to take a little longer. There will be |
|
some performance loss when these conditions do arise, but so long as they are rare |
|
enough the impact on overall performance will be small. |
14.2
AMULET1
The AMULET 1 processor core has the high-level organization illustrated in Figure 14.3 on page 378. The design is based upon a set of interacting asynchronous

378 |
The AMULET Asynchronous ARM Processors |
|
Figure 14.3 AMULET1 internal organization. |
|
pipelines, all operating in their own time at their own speed. These pipelines might |
|
appear to introduce unacceptably long latencies into the processor but, unlike a syn- |
|
chronous pipeline, an asynchronous pipeline can have a very low latency. |
|
The operation of the processor begins with the address interface issuing instruction |
|
fetch requests to the memory. The address interface has an autonomous address |
|
incre-menter which enables it to prefetch instructions as far ahead as the capacities of |
|
the various pipeline buffers allow. |
Address |
Instructions that need to generate a new memory address, such as data transfer instruc- |
non-determin |
tions and unpredicted branches, calculate the new address in the execution pipeline and |
ism |
then send it to the address interface. Since it arrives with arbitrary timing relative to the |
|
internal incrementing loop within the address interface, the point of insertion of the |
|
new address into the address stream is non-deterministic, so the processor's depth of |
|
prefetch beyond a branch instruction is fundamentally non-deterministic. |