Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Furber S.ARM system-on-chip architecture.2000

.pdf
Скачиваний:
89
Добавлен:
23.08.2013
Размер:
18.35 Mб
Скачать

AMULETS

389

Execute

Data memory interface

Reorder buffer

Register write

allows some elasticity in its operation, and there is a separate Thumb decode stage that effectively collapses when ARM code is running.

The register read and forwarding stage collects operands from the register file and searches the reorder buffer for newer values, supplying the correct up-to-date value to the execution unit. If the correct value is not yet available the forwarding process will stall until it becomes available.

Following the insights of the StrongARM designers, the AMULETS register file has three read ports. This allows nearly all instructions to be issued in a single cycle.

The execution stage includes the adder, shifter and multiplier, and the PSRs also fit logically into this stage. The multiplier computes 8 bits of product in a cycle, but cycles much faster than the typical processor cycle time so it computes a 32 x 32 product in around 20 ns. The adder uses the carry arbitration scheme also used in the ARM9TDMI (see 'Carry arbitration adder' on page 91).

The data memory interface operates as a separate pipeline stage in load and store instructions but is bypassed by other types of instruction. A load that addresses a slow memory location will therefore not hold up other instructions in the pipeline, provided that they do not use the loaded value.

The reorder buffer accepts results from the execution unit and the data memory interface in the order that they become available and stores them until they can be written back to the register file in program order. Once a result is in the reorder buffer it is available for forwarding to any instruction that needs it.

The structure used to achieve this functionality is illustrated in Figure 14.11 on page 390. The buffer has four 'slots', and a slot is allocated to each result produced by an instruction when that instruction is issued. (Slots are also allocated to 'store' instructions so that memory faults can be handled correctly.) As soon as the result is available it is moved into its slot.

Forwarding from the reorder buffer requires a search of the buffer contents. Conditional ARM instructions may or may not return valid results to the buffer; multiple instructions may use the same destination register. Identifying the most up-to-date value for a particular register potentially requires waiting for a result to become available in a slot, seeing that it is invalid, and then (worst case) searching all the other slots in series. This case is rare, though, and is easily accommodated by the asynchronous timing framework.

The write-back stream delivered by the reorder buffer is in program order, so this stage of the pipeline simply delivers values to the register file, checking each one for validity and looking for any memory faults. If a fault is detected an exception is raised and subsequent results are discarded unti\ the exception handling, mechanism has activated.

390

The AMULET Asynchronous ARM Processors

Figure 14.11 The AMULETS reorder buffer organization.

AMULETS

The performance characteristics of AMULET3 are summarized in Table 14.4. It can

performance

be seen that the objective of achieving comparability with the ARM9TDMI on the

 

same process has been met.

 

 

 

 

AMULET3 has been used as the processing core in the DRACO telecommunica-

 

tions controller; this is the subject of the next section.

 

Table 14.4 AMULET3 characteristics.

 

 

 

 

 

 

 

 

 

Process

0.35 urn

Transistors

113,000

MIPS

120

 

Metal layers

3

Core area

3mm2

Power

154mW

 

Vdd

3.3V Clock

none MIPS/W

780

 

 

 

 

 

 

 

 

14.6The DRACO telecommunications controller

The DRACO (DECT Radio Communications Controller) chip is the first commercial design based on AMULET technology; it uses the AMULET3H self-timed processing subsystem described later in this section as its compute and control engine.

DRACO was developed in collaboration between Hagenuk GmbH (who designed the clocked telecommunications peripherals) and the University of Manchester (who were responsible for the AMULETS H subsystem) with funding from the European Union.

The DRACO telecommunications controller

391

Rationale for self-timing

DRACO functions

The decision to employ a self-timed processing subsystem in the DRACO chip was influenced primarily by the electromagnetic compatibility (EMC) advantages of self-timed logic. The radio interference generated by high-speed clocked systems can compromise the performance of the DECT radio communication; a harmonic of the clock is likely to fall within one of the DECT channel frequency bands, possibly rendering that channel unusable.

The logical conclusion might be to make all of the chip operate in self-timed logic. However, many of the telecommunications interfaces are intrinsically highly synchronous and are much easier to design using synchronous synthesis tools.

The synchronous peripheral subsystem should not compromise the EMC advantages offered by the self-timed processing subsystem since the characteristic frequencies of the peripherals are all much lower than the processing rate. All the high-speed processing and the heavily loaded external memory bus operate asynchro-nously, giving most of the advantages of a fully self-timed system without incurring the design cost of developing self-timed telecommunications peripherals.

The DRACO chip includes the following functions in the synchronous peripheral subsystem:

an ISDN controller with a 16 Kbit/s HDLC controller and transformerless ana logue ISDN interfaces;

a DECT radio interface baseband controller and analogue interface to the DECT radio subsystem;

a DECT encryption engine;

a four-channel full-duplex ADPCM/PCM conversion signal processor;

8 Kbytes of shared RAM used for buffer space by the DECT controller;

a telecommunications codec with an analogue front-end for speech input and output which could alternatively be used for an analogue telecommunications port;

two high-speed UARTs with an IrDA interface that can be used by either UART;

an interrupt controller;

counter-timers and a watchdog timer;

a 2 Mbit/s IOM2 highway controller with programmable switching functionality;

an I2C interface;

an analogue-to-digital converter (ADC) interface;

65 flexible I/O ports including an 8-bit parallel port with handshake capability;

2 general-purpose pulse-width modulation controllers;

an on-chip clock oscillator produces 38.864 MHz, and on-chip phase-locked loops produce the 12.288 MHz master clock required by the ISDN interface and the 13.824 MHz master clock required by the DECT controller;

392

The AMULET Asynchronous ARM Processors

an ISDN-DECT synchronizer phase-locked loop avoids bit loss in data transfers between the ISDN and DECT clock domains;

the clock system has power-down features.

In addition, the self-timed AMULET3H processing subsystem incorporates:

a 100 MIPS AMULET3 32-bit processor that implements ARM architecture v4T (including the Thumb 16-bit compressed instruction set) with debug hardware;

8 Kbytes of dual-port high-speed RAM (local to the processor);

a self-timed on-chip bus with a bridge to the synchronous on-chip bus that con nects the synchronous peripherals;

a 32-channel DMA controller;

16 Kbytes of ROM holding standard telecommunications application software;

an asynchronous event driven load module that holds the processor in zero-power wait mode while an analogue-to-digital conversion completes, thereby synchro nizing the software to external data rates;

a programmable external memory interface that supports the direct connection of SRAM, DRAM and flash memory;

an on-chip reference delay line calibrated by software to control off-chip memory access timings.

AMULET3H

AMULET3H is a subsystem based around an AMULET3 core that forms an asyn-

 

chronous 'island' at the heart of the DRACO telecommunications controller in

 

which it is interfaced to a range of synchronous peripheral controllers. To reap the

 

benefits of asynchronous operation the core must have access to some memory that

 

operates asynchronously, and there are significant electromagnetic compatibility

 

benefits in having off-chip memory also operate asynchronously.

MARBLE on-

The organization of the asynchronous subsystem is illustrated in Figure 14.12 on

Chip bus

page 393. The AMULET3 core is connected directly to a dual-ported RAM (dis-

 

cussed further below) and then to the MARBLE on-chip bus. MARBLE is similar

 

in concept to ARM's AMBA bus, the major difference being that it does not use a

 

clock signal. Its transfer mechanism is based around a split transaction primitive.

 

 

System components other than the local RAM are accessed via the MARBLE

 

bus. These include on-chip ROM, a DMA controller, a bridge to the synchronous bus

 

where the application-specific peripheral controllers reside, and an interface to exter-

 

nal memory.

External memory

The external memory interface presents a conventional set of signals for

 

 

off-chip

interface

devices. It is similar to the AMULET2e interface (see Section 14.4 on page 384),

 

being highly configurable and using a reference delay to time external accesses.

The DRACO telecommunications controller

393

Synchronous peripheral interface

 

Figure 14.12 The AMULET3H asynchronous subsystem.

 

Instead of using an off-chip reference delay, however, AMULETS H has an on-chip

 

delay which can be calibrated by software against a timing reference, perhaps using

 

a timer/counter in the synchronous peripheral subsystem for this purpose.

 

Off-chip DRAM is supported directly, as are conventional ROM, SRAM and

 

memory mapped peripheral devices.

AMULET3H

The AMULETS processor core has separate address and data buses for instruction

memory

and data memory accesses. This would normally require separate instruction and

Organization

data memories; RISC systems frequently employ a 'modified Harvard' architecture

 

where there are separate instruction and data caches with a unified main memory.

 

The AMULET3H controller employs direct-mapped RAM rather than cache

 

memory as this is more cost-effective and has more deterministic behaviour for real-

 

time applications. It also avoids separate instruction and data memories (and the asso-

 

ciated difficulties of keeping them coherent) through the use of a dual-ported memory

 

structure (see Figure 14.13 on page 394). Dual-porting the memory at the individual

 

bit level would be too costly, so instead the memory is divided into eight 1 Kbyte

 

blocks, each of which has two ports which are arbitrated internally. When concurrent

 

data and instruction accesses are to different RAM blocks, each can proceed

394

Test interface controller

The AMULET Asynchronous ARM Processors

Figure 14.13 AMULET3H memory organization.

unimpeded by the other; when they happen to conflict on the same block, one access will suffer a delay while it waits for the other to complete.

Conflicts (and average memory access times) are further reduced by including separate quad-word instruction and data buffers in each RAM block. Each access to a block first checks to see whether the required data is in the buffer. Only if it is not must the RAM be interrogated, with a risk of conflict. Simulations suggest that about 60% of instruction fetches may be satisfied from within these buffers and many short, time-critical loops will run entirely from them.

These buffers, in effect, form simple 128-byte first-level caches in front of the RAM blocks. This is a particularly apt analogy when it is observed that the avoidance of the RAM array results in a faster read cycle, an occurrence which is accepted automatically by the asynchronous pipeline.

AMULET3H is being developed for commercial use and must therefore be testable in production. There are many features in the design to ensure that this is possible, the most visible of which is the test interface controller (see Figure 14.12 on page 393). This extension to the external memory interface logic follows the example set by AMBA (see "Test interface" on page 219) in enabling the external memory interface, which in normal operation is a MARBLE slave, to become a MARBLE master for test purposes. In test mode production test equipment can become a MARBLE master, allowing it to read on-chip ROM and to read and write on-chip RAM, and to access the many other test facilities built into the AMULET3H system, all of which are controlled via test registers connected to the MARBLE bus.

The DRACO telecommunications controller

395

AMULET3H

The simulated performance characteristics of the AMULET3H subsystem are sum-

performance

marized in Table 14.5. (The transistor count, area and power figures are for the

 

asynchronous subsystem only. Overall the DRACO die is 7.8 x 7.3 mm.) The maxi-

 

mum system performance is slightly lower than that of the AMULET3 core

 

(reported in Table 14.4 on page 390) since the on-chip RAM cannot supply the

 

processor's peak instruction bandwidth requirement. The system power-efficiency is

 

lower than that of the processor alone since the memory system power is now

 

included as well as the power consumed by the processor core itself.

 

Table 14.5

AMULET3H characteristics.

 

 

 

 

 

 

 

 

 

Process

0.35 urn

Transistors

825,000

MIPS

100

 

Metal layers

3

Core area

21mm2

Power

215 mW

 

Vdd

3.3V

Clock

none MIPSAV

465

 

 

 

 

 

 

 

 

Figure 14.14 DRACO die plot.

It can be seen that the AMULET3H subsystem is competitive with the equivalent clocked ARM systems in terms of performance and power-efficiency. It has the

396 The AMULET Asynchronous ARM Processors

 

advantages of better electromagnetic interference and system power management

 

properties. Against these advantages must be weighed the present shortcomings in the

 

available tool support for asynchronous design.

DRACO

The DRACO chip can be used in a number of telecommunications applications. A

applications

typical application is a combined ISDN terminal and DECT base-station which

 

would enable a number of users to communicate with each other using cordless

 

DECT handsets and to connect to the telecommunications network via the ISDN

 

line or a cordless ISDN network terminator. The principal target market for the chip

 

is in cordless DECT data applications.

DRACO silicon

A plot of the DRACO silicon layout is shown in Figure 14.14 on page 395. The

 

AMULET3H asynchronous subsystem occupies the bottom half of the core area;

 

the synchronous telecommunications peripherals occupy the top half. First silicon

 

was fabricated early in 2000.

14.7A self-timed future?

Although AMULET3 will be used commercially the AMULET programme has, so far, been primarily of a research nature, and there is no immediate prospect of the AMULET cores replacing the clocked ARM cores in widespread use. However, there is a resurgence of world-wide interest in the potential of asynchronous design styles to save power, to improve electromagnetic compatibility, and to offer a more modular approach to the design of computing hardware.

The power savings which result from removing the global clock, leaving each subsystem to perform its own timing functions whenever it has useful work to perform, are clear in theory but there are few demonstrations that the benefits can be realized in practice with circuits of sufficient complexity to be commercially interesting. The AMULET research is aimed directly at adding to the body of convincing demonstrations of the merits of asynchronous technology.

An obstacle to the widespread adoption of self-timed design styles is the knowledge-base of the existing design community. Most 1C designers have been trained to have a strong aversion to asynchronous circuits because of the difficulties that were experienced by the designers of some early asynchronous computers. These difficulties resulted from an undisciplined approach to self-timed design, and modern developments offer asynchronous design frameworks which overcome most of the problems inherent in what is, admittedly, a more anarchic approach to logic design than that offered within the clocked framework.

Example and exercises

397

 

The next few years will tell whether or not AMULET and similar developments

 

around the world can demonstrate the sort of advantages that will cause designers to

 

throw away most of their past education and learn a new way to perform their duties.

AMULET

AMULET 1 was developed using European Community funding within the Open

Support

Microprocessor systems Initiative - Microprocessor Architecture Project (OMI-

 

MAP). AMULET2 was developed within the Open Microprocessor systems Initia-

 

tive - Deeply Embedded ARM (OMI-DE/ARM) project. The development of

 

AMULETS and DRACO was supported primarily within the European Union

 

funded OMI-DE2 and OMI-ATOM projects. Aspects of the work have benefited

 

from support from the UK government through the Engineering and Physical Sci-

 

ences Research Council (EPSRC) in the form of a PhD studentships and the tools

 

development funded under ROPA grant GR/K61913.

 

The work also received support in various forms from ARM Limited, GEC Plessey

 

Semiconductors (now part of MITEL) and VLSI Technology, Inc. Tools from Com-

 

pass Design Automation (now part of Avant!) were important to the success of both

 

projects, and TimeMill from EPIC Design Technology, Inc. (now part of Synopsys)

 

was vital to the accurate modelling of AMULET2 and AMULETS.

14.8Example and exercises

Example 14.1

Summarize the advantages and disadvantages of self-timed design.

 

The advantages discussed here are:

 

• improved electromagnetic compatibility (EMC);

 

• improved power-efficiency;

 

• improved design modularity;

 

• the removal of the clock-skew problem;

 

• the potential for typical rather than worst-case performance.

 

The disadvantages are:

 

• the unavailability of design tools for self-timed design;

 

• the lack of designer experience;

 

• an aversion to self-timed techniques has been encouraged in design education;

 

• there is a lack of commercial-scale demonstrations of the above-claimed advantages.

 

In general it is still a high-risk option to pursue a self-timed solution to a design

 

problem, and the lack of tools can make it very labour intensive.

398 The AMULET Asynchronous ARM Processors

Exercise 14.1.1

Estimate the effect on the performance of both clocked and self-timed processors of

 

memory conflicts when the processor core has separate instruction and data ports

 

both of which are connected to a single dual-port memory. Assume the memory is

 

constructed from eight arbitrated segments similar to the AMULET3H memory but

 

without the line buffers (see "AMULET3H memory organization" on page 393).

 

You can assume that the processor fetches instructions continuously and requires

 

about one data memory access for every two instructions. The instruction and data

 

accesses from the clocked processor will be synchronized (by the clock!), so every

 

time contention arises one of the two accesses will be stalled for one cycle. The asyn-

 

chronous processor has no such sychronization so the two accesses have random rela-

 

tive timing and the stall will be for whatever proportion of the first contending access

 

time remains when the second contending access is requested.

Exercise 14.1.2

Estimate the effect on performance of the instruction and data line buffers in the

 

AMULET3H memory system. Make the same assumptions as in the previous exercise.