Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Furber S.ARM system-on-chip architecture.2000

.pdf
Скачиваний:
99
Добавлен:
23.08.2013
Размер:
18.35 Mб
Скачать

TheSA-1100

369

 

Figure 13.16 SA-1100 organization.

Memory

The memory controller supports up to four banks of 32-bit off-chip DRAM, which

controller

may be conventional or of the 'extended data out' (EDO) variety. ROM, flash

 

memory and SRAM are also supported.

 

Further memory expansion is supported through the PCMCIA interface (which

 

requires some external 'glue' logic), where two card slots are supported.

370

Embedded ARM Applications

System control

Peripherals

Bus structure

Applications

SA-1100 silicon

The on-chip system control functions include:

a reset controller;

a power management controller that handles low-battery warnings and switches the system between its various operating modes;

an operating system timer block that supports general timing and watchdog functions;

an interrupt controller;

a real-time clock that runs from a 32 KHz crystal source;

28 general-purpose I/O pins.

The peripheral subsystem includes an LCD controller and specialized serial ports for USB, SDLC, IrDA, codec and standard UART functions. A 6-channel DMA controller releases the CPU from straightforward data transfer responsibilities.

As can be seen from Figure 13.16, the SA-1100 is built around two buses connected through a bridge:

The system bus connects all the bus masters and the off-chip memory.

The peripheral bus connects all the slave peripheral devices.

This dual bus structure is similar to the AMBA ASB-APB (or AHB-APB) split. It minimizes the size of the bus that has a high duty cycle and also reduces the complexity and cost of the bus interface that must be built into all of the peripherals.

A typical SA-1100 application will require a certain amount of off-chip memory, probably including some DRAM and some ROM and/or flash memory. All that is then required is the necessary interface electronics for the various peripheral interfaces, display, and so on. The resulting system is very simple at the PCB level, yet very powerful and sophisticated in terms of its processing capability and system architecture.

The characteristics of the SA-1100 chip are summarized in Table 13.4 and a plot of the die is shown in Figure 13.17 on page 371. The chip can operate with a power supply voltage as low as 1.5V for optimum power-efficiency. Higher performance can be achieved at the cost of somewhat lower power-efficiency by operating the device at a slightly higher supply voltage.

Table 13.4 SA-1100 characteristics.

Process

0.35 um

Transistors

2,500,000

MIPS

220/250

Metal layers

3

Die area

75mm2

Power

330/550 mW

Vdd

1.5/2V

Clock

190/220 MHz MIPS/W

665/450

 

 

 

 

 

 

Examples and exercises

371

Figure 13.17 SA-1100 die plot.

13.8Examples and exercises

Example 13.1

Estimate the performance improvement which results from running a

 

critical DSP routine in zero wait state on-chip RAM instead of two wait

 

state off-chip RAM.

 

Typical DSP routines are dominated by multiply-accumulate computations. A code

 

sequence might be:

 

 

 

initialize get next data

LOOP

LDR

rO, [r3], #4

value and next

 

LDR

rl, [r4], #4

coefficient accumulate

 

MLA

r5, rO, rl, r5

next term decrement loop

 

SUBS

r2, r2, #1

counter loop if not

 

BNE

LOOP

finished

This loop requires seven instruction fetches (including two in the branch shadow) and two data fetches. On a standard ARM core the multiply takes a data-dependent number of computation cycles, evaluating two bits per cycle. If we assume 16-bit

372 Embedded ARM Applications

coefficients and order the multiplication so that the coefficient determines the number of cycles, the multiply will require eight internal cycles. Each load also requires an internal cycle.

 

If the data and coefficients are always in on-chip memory, the loop takes 19 clock

 

cycles if executed from on-chip RAM, with 14 additional wait state cycles if executed

 

from two wait state off-chip RAM.

 

The use of on-chip RAM therefore speeds the loop up by around 75%.

 

Note that this speed-up is a lot less than the factor three that might be expected

 

from simply comparing memory access speeds.

Exercise 13.1.1

Estimate the power saving that results from using the on-chip RAM. (Assume that

 

on-chip accesses cost 2 nJ and off-chip accesses 10 nJ.)

Exercise 13.1.2

Repeat the estimates assuming the external memory is restricted to!6 bits wide.

Exercise 13.1.3

Repeat the estimates for both 32and 16-bit external memory assuming that the

 

processor supports Thumb mode (and the program is rewritten to use Thumb

 

instructions) and high-speed multiplication.

Example 13.2

Survey the designs presented in this chapter and summarize the

 

power-saving techniques employed on the system chips

 

described here.

 

Firstly, all the system chips display a high level of integration which saves power

 

whenever two modules on the same chip exchange data.

 

All the chips are based around an ARM core which is, itself, very power-efficient.

 

Though none of the chips has enough on-chip memory to remove the need for

 

off-chip memory completely, several have a few kilobytes of on-chip RAM which can

 

be loaded with critical routines to save power (and improve performance).

 

Several of the chips incorporate clock control circuitry which reduces the clock fre-

 

quency (or stops the clock altogether) when the chip is not fully loaded, and switches

 

the clock to individual modules off when they are inactive.

 

The chips are mainly based upon static or pseudo-static CMOS technology where,

 

with a little care, digital circuits only consume power when they switch. Some of the

 

chips also incorporate analogue functions which require continuous bias currents, so

 

these circuits can usually be powered down when they are inactive.

Exercise 13.2.1

In embedded systems there is usually no 'reset' button to restart a system that has

 

crashed. What techniques are used to recover from software crashes?

Examples and exercises

373

Exercise 13.2.2

Several of the chips described in this chapter incorporate counter/timer modules.

 

What are these used for?

Exercise 13.2.3

The ARM7500 has an on-chip cache, AMULET2e (see Section 14.4 on page 384)

 

has on-chip memory which can be configured either as a cache or as directly

 

addressed memory, and Ruby II and VIP simply have directly addressed on-chip

 

memory. Explain the difference between on-chip RAM and an on-chip cache and

 

explain why the designers of these chips made the choices that they did.

The AMULET Asynchronous

ARM Processors

Summary of chapter contents

The AMULET processor cores are fully asynchronous implementations of the ARM architecture, which means that they are self-timed and operate without any externally supplied clock.

Self-timed digital systems have potential advantages in the areas of power-efficiency and electromagnetic compatibility (EMC), but they require a radical redesign of the system architecture if the asynchronous timing framework is to be fully exploited, and designers need to learn a new mind set if they are to work in this style. Support for self-timed design from the tools industry is also lacking.

The AMULET processor cores are research prototypes, developed at the University of Manchester in England, and their commercial future is not yet clear. There is insufficient space in this book to present complete descriptions of them, but they are outlined here since they represent another, very different way to implement the ARM instruction set. See the 'Bibliography' on page 410 for references where further details on AMULET1 and AMULET2e may be found. Fuller details on AMULETS will be published in due course.

The latest self-timed core, AMULET3, incorporates most of the features found in the clocked ARM cores, including support for the Thumb instruction set, debug hardware, an on-chip macrocell bus which supports system-on-chip modular design, and so on. It brings the asynchronous technology up to the point where it is ready to enter commercial use, and the next few years will tell if this technology has a role to play in the future of the ARM architecture.

374

Self-timed design

375

14.1Self-timed design

Today, and for the past quarter of a century, virtually all digital design has been based around the use of a global 'clock' signal which controls the operation of all of the components of the system. Complex systems may use multiple clocks, but then each clocked domain is designed as a synchronous subsystem, and the interfaces between the different domains have to be designed very carefully to ensure reliable operation.

 

Clocks have not always been so dominant. Many of the earliest computers

 

employed control circuits which used local information to control local activity. How-

 

ever, such 'asynchronous' design styles fell out of favour when highly productive

 

design methodologies based around the use of a central clock were developed to

 

match the demands of an industry with access to integrated circuit technologies which

 

offered rapidly increasing transistor resources.

Clock problems

Recently, however, clocked design styles have begun to run into practical

problems:

 

Motivation for self-timed design

Ever-increasing clock frequencies mean that maintaining global synchrony is getting harder; clock skew (the difference in the clock timing at different points on the chip) compromises performance and can ultimately lead to circuit malfunction.

High clock rates also lead to excessive power consumption. The clock distribu tion network can be responsible for a significant fraction of the total chip power consumption. Although clock-gating techniques can give a degree of control over the activity of the clock net, this is usually coarse-grained and can adversely affect clock skew.

Global synchrony maximizes supply current transients, causing electromagnetic interference. EMC (Electromagnetic compatibility) legislation is becoming more stringent, and increasing clock rates make compliance ever more challenging.

For these reasons, asynchronous design techniques have recently attracted renewed interest as they address the above problems as follows:

• There is no problem with clock skew because there is no clock.

An asynchronous design only causes transitions in the circuit in response to a request to carry out useful work, avoiding the continuous drain caused by the clock signal and the overhead of complex power management systems. It can switch instantaneously between zero power dissipation and maximum performance. Since many embedded applications have rapidly varying workloads, an asynchronous processor appears to offer the potential of significant power savings.

Asynchronous circuits emit less electromagnetic radiation owing to their less coherent internal activity.

376 The AMULET Asynchronous ARM Processors

In addition, an asynchronous circuit has the potential to deliver typical rather than worst-case performance since its timing adjusts to actual conditions whereas a clocked circuit must be toleranced for worst-case conditions.

 

The AMULET series of asynchronous microprocessors was developed at the

 

University of Manchester, in England, as one aspect of the growing global research

 

into asynchronous design. There are other examples of asynchronous microprocessor

 

developed elsewhere which support other instruction set architectures, but only the

 

AMULET processors implement the ARM instruction set.

Self-timed

Asynchronous design is a complex discipline with many different facets and many

signalling

different approaches. It is outside the scope of this book to offer any general intro-

 

duction to asynchronous design, but a few basic concepts should enable the reader

 

to come to grips with the most important features of the AMULET cores. The fore-

 

most of these concepts is the idea of asynchronous communication. How is the flow

 

of data controlled in the absence of any reference clock?

 

The AMULET designs both use forms of the Request—Acknowledge handshake to

 

control the flow of data. The sequence of actions comprising the communication of

 

data from the Sender to the Receiver is as follows:

 

1. The Sender places a valid data value onto a bus.

 

2. The Sender then issues a Request event.

 

3. The Receiver accepts the data when it is ready to do so.

 

4. The Receiver issues an Acknowledge event to the Sender.

 

5. The Sender may then remove the data from the bus and begin the next communi

 

cation when it is ready to do so.

 

The data is passed along the bus using a conventional binary encoding, but there

 

are two ways that the Request and Acknowledge events may be signalled:

Transition

• AMULET 1 uses transition encoding where a change in level (either high to low

signalling

or low to high) signals an event; this is illustrated in Figure 14.1.

Figure 14.1 Transition-signalling communication protocol.

AMULET1

377

Level signalling AMULET2 and AMULETS use level encoding where a rising edge signals an event and a return-to-zero phase must occur before the next event can be signalled; this is illustrated in Figure 14.2.

 

Figure 14.2 Level-signalling communication protocol.

 

Transition signalling was used on AMULET 1 since it is conceptually cleaner;

 

every transition has a role and its timing is therefore determined by the circuit's

 

function. It also uses the minimum number of transitions, and should therefore be

 

power-efficient. However, the CMOS circuits used to implement transition control are

 

relatively slow and inefficient, so AMULET2 and AMULETS use level signalling

 

which employs circuits which are faster and more power-efficient despite using twice

 

the number of transitions, but leave somewhat arbitrary decisions to be taken about the

 

timing of the recovery (return-to-zero) phases in the protocol.

Self-timed

An asynchronous pipelined processing unit can be constructed using self-timing tech-

pipelines

niques to allow for the processing delay in each stage and one of the above protocols

 

to send the result to the next stage. When the circuit is correctly designed variable

 

processing and external delays can be accommodated; all that matters is the local

 

sequencing of events (though long delays will, of course, lead to low performance).

 

Unlike a clocked pipeline, where the whole pipeline must always be clocked at a

 

rate determined by the slowest stage under worst-case environmental (voltage and

 

temperature) conditions and assuming worst-case data, an asynchronous pipeline will

 

operate at a variable rate determined by current conditions. It is possible to allow rare

 

worst-case conditions to cause a processing unit to take a little longer. There will be

 

some performance loss when these conditions do arise, but so long as they are rare

 

enough the impact on overall performance will be small.

14.2

AMULET1

The AMULET 1 processor core has the high-level organization illustrated in Figure 14.3 on page 378. The design is based upon a set of interacting asynchronous

378

The AMULET Asynchronous ARM Processors

 

Figure 14.3 AMULET1 internal organization.

 

pipelines, all operating in their own time at their own speed. These pipelines might

 

appear to introduce unacceptably long latencies into the processor but, unlike a syn-

 

chronous pipeline, an asynchronous pipeline can have a very low latency.

 

The operation of the processor begins with the address interface issuing instruction

 

fetch requests to the memory. The address interface has an autonomous address

 

incre-menter which enables it to prefetch instructions as far ahead as the capacities of

 

the various pipeline buffers allow.

Address

Instructions that need to generate a new memory address, such as data transfer instruc-

non-determin

tions and unpredicted branches, calculate the new address in the execution pipeline and

ism

then send it to the address interface. Since it arrives with arbitrary timing relative to the

 

internal incrementing loop within the address interface, the point of insertion of the

 

new address into the address stream is non-deterministic, so the processor's depth of

 

prefetch beyond a branch instruction is fundamentally non-deterministic.