Добавил:

Andrey Опубликованный материал нарушает ваши авторские права? Сообщите нам.

Вуз:

Санкт-Петербургский государственный электротехнический университет "ЛЭТИ"

Предмет:

Электротехника

Файл:

Furber S.ARM system-on-chip architecture.2000

.pdf

Скачиваний:

Добавлен:

23.08.2013

Размер:

18.35 Mб

Скачать

☆

<<< < Предыдущая 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 3940 / 4340 41 42 43 > Следующая >>>

AMULET1

379

If the execution pipeline is to work efficiently, the register file must be able to issue the operands for an instruction before the result of the previous instruction has returned. However, in some cases an operand may depend on a preceding result (this is a read-after-write hazard), in which case the register file cannot issue the operand until the result has returned unless there is a forwarding mechanism to supply the correct value further down the pipeline.

The register forwarding mechanism used on many RISC processors (including ARMS and StrongARM) is based upon the characteristics of a synchronous pipeline since it involves comparing the source operand register number in one pipeline stage with the destination register number in another stage. In an asynchronous pipeline the stages are all moving at different times and such a comparison can only be made by introducing explicit synchronization between the stages concerned, thereby losing most of the benefits of asynchronous operation.

On AMULET 1 register coherency is achieved through a novel form of register locking, based on a register lock FIFO (first-in-first-out queue). The destination register numbers are stored, in decoded form, in a FIFO, until the associated result is returned from the execution or memory pipeline to the register bank.

The organization of the lock FIFO is illustrated in Figure 14.4. Each stage of the FIFO holds a ' 1' in the position corresponding to the destination register. In this figure the FIFO controls 16 registers (in AMULET 1 the FIFO has a column for each physical register, including the ARM banked registers) and is shown in a state where the first result to arrive will be written into rO, the second into r2, the third into r!2 and the fourth FIFO stage is empty.

Figure 14.4 AMULET register lock FIFO organization.

380	The AMULET Asynchronous ARM Processors

If a subsequent instruction requests r!2 as a source operand, an inspection of the FIFO column corresponding to r!2 (outlined in the figure) reveals whether or not r!2 is valid. AT anywhere in the column signifies a pending write to r!2, so its current value is obsolete. The read waits until the ' 1' is cleared, then it can proceed.

The 'inspection' is implemented in hardware by a logical 'OR' function across the column for each register. This may appear hazardous since the data in the FIFO may move down the FIFO while the 'OR' output is being used. However, data moves in an asynchronous FIFO by being duplicated from one stage into the next, only then is it removed from the first stage, so a propagating ' 1' will appear alternately in one or two positions and it will never disappear completely. The 'OR' output will therefore be stable even though the data is moving.

AMULET 1 depends entirely on the register locking mechanism to maintain register coherency, and as a result the execution pipeline is stalled quite frequently in typical code. (Register dependencies between consecutive instructions are common in typical code since the compiler makes no attempt to avoid them because standard ARM processors are insensitive to such dependencies.)

AMULET!	AMULET 1 was developed to demonstrate the feasibility of designing a fully asyn-
performance	chronous implementation of a commercial microprocessor architecture. The proto-
	type chips were functional and ran test programs generated using standard ARM
	development tools. The performance of the prototypes is summarized in Table 14.1,
	which shows the characteristics of the devices manufactured on two different pro-
	cess technologies by European Silicon Systems (ES2) and GEC Plessey Semicon-
	ductors (GPS). The layout of the 1 \|im AMULET1 core is shown in Figure 14.5. The
	performance figures, based on the Dhrystone benchmark, show a performance
	which is of the same order as, but certainly no better than, an ARM6 processor built
	Table 14.1	AMULET1 characteristics.

	AMULET1/ES2		AMULET1/GPS	ARM6

Process Area	1 um		0.7 \|im	1 um
(mm2)	5.5x4.1		3.9x2.9	4.1x2.7
Transistors	58,374		58,374	33,494
Performance	20.5 kDhrystones		~40 kDhrystones3 3	31 kDhrystones
Multiplier	5.3 ns/bit 5V,20°C		ns/bit 5V,20°C	25 ns/bit 5 V, 20
Conditions	152mW		N/Ab	MHz 148 mW
Power
MIPS/W	77		N/A	120

a. Estimated maximum performance, b. The GPS part
does not support power measurement.

AMULET2	381
•

Figure 14.5 AMULET1 die plot.

on the same process technology. However, AMULET 1 was built primarily to demonstrate the feasibility of self-timed design, which it manifestly does.

14.3AMULET2

	AMULET2 is the second-generation asynchronous ARM processor. It employs an
	organization which is very similar to that used in AMULET 1, as illustrated in
	Figure 14.3 on page 378. As described earlier, the two-phase (transition) signalling
	used on AMULET 1 was abandoned in favour of four-phase (level) signalling. In
	addition, a number of organizational features were added to enhance performance.
AMULET2	AMULET2 employs the same register-locking mechanism as AMULET 1, but in
register	order to reduce the performance loss due to register dependency stalls, it also incor-
forwarding	porates forwarding mechanisms to handle common cases. The bypass mechanisms
	used in clocked processor pipelines are inapplicable to asynchronous pipelines, so
	novel techniques are required. The two techniques used on AMULET2 are:
	• A last result register. The instruction decoder keeps a record of the destination of
	the result from the execution pipeline, and if the immediately following instruc
	tion uses this register as a source operand the register read phase is bypassed and
	the value is collected from the last result register.
	• A last loaded data register. The instruction decoder keeps a record of the destination
	of the last data item loaded from memory, and whenever this register is used as a
	source operand the register read phase is bypassed and the value is picked up directly

382 The AMULET Asynchronous ARM Processors

	from the last loaded data register. A mechanism similar to the lock FIFO serves as a
	guard on the register to ensure that the correct value is collected.
	Both these mechanisms rely on the required result being available; where there is
	some uncertainty (for example when the result is produced by an instruction which is
	conditionally executed) the instruction decoder can fall back on the locking mecha-
	nism, exploiting the ability of the asynchronous organization to cope with variable
	delays in the supply of the operands.
AMULET2 jump	AMULET 1 prefetches instructions sequentially from the current PC value and all devia-
trace buffer	tions from sequential execution must be issued as corrections from the execution pipe-
	line to the address interface. Every time the PC has to be corrected performance is lost
	and energy is wasted in prefetching instructions that are then discarded.
	AMULET2 attempts to reduce this inefficiency by remembering where branches
	were previously taken and guessing that control will subsequently follow the same
	path. The organization of the jump trace buffer is shown in Figure 14.6; it is similar to
	that used on the MU5 mainframe computer developed at the University of Manchester
	between 1969 and 1974 (which also operated with asynchronous control).

Figure 14.6 The AMULET2 jump trace buffer.

The buffer caches the program counters and targets of recently taken branch instructions, and whenever it spots an instruction fetch from an address that it has stored it modifies the predicted control flow from sequential to the previous branch target. If this prediction turns out to be correct, exactly the right instruction sequence is fetched from memory; if it turns out to be wrong and the branch should not have

AMULET2

383

been taken, the branch is executed as an 'unbranch' instruction to return to the previous sequential flow.

Branch Statistics	The effectiveness of the jump trace buffer depends on the statistics of typical
	branch behaviour. Typical figures are shown in Table 14.2.
	Table 14.2	AMULET2 branch prediction statistics.

Prediction algorithm	Correct	Incorrect		Redundant
fetches
Sequential	33%	67%	2 per branch
Trace buffer	67%	(ave.) 33%		1 per

In the absence of the jump trace buffer, the default sequential fetch pattern is equivalent to predicting that all branches are not taken. This is correct for one-third of all branches, and incorrect for the remaining two-thirds. A jump trace buffer with around 20 entries reverses these proportions, correctly predicting around two-thirds of all branches and mispredicting or failing to predict around one-third.

Although the depth of prefetching beyond a branch is non-deterministic, a prefetch depth of around three instructions is observed on AMULET2. The fetched instructions are used when the branch is predicted correctly, but are discarded when the branch is mispredicted or not predicted. The jump trace buffer therefore reduces the average number of redundant fetches per branch from two to one. Since branches occur around once every five instructions in typical code, the jump trace buffer may be expected to reduce the instruction fetch bandwidth by around 20% and the total memory bandwidth by 10% to 15%.

Where the system performance is limited by the available memory bandwidth, this saving translates directly into improved performance; in any case it represents a power saving due to the elimination of redundant activity.

'Halt'	Unlike some other microprocessors, the ARM does not have an explicit 'Halt'
	instruction. Instead, when a program can find no more useful work to do it usually
	enters an idle loop, executing:
	B	.	;	loop until	interrupted

Here the '.' denotes the current PC, so the branch target is the branch instruction itself, and the program sits in this single instruction loop until an interrupt causes it to do something else.

Clearly the processor is doing no useful work while in this idle loop, so any power it uses is wasted. AMULET2 detects the opcode corresponding to a branch which loops to itself and uses this to stall a signal at one point in the asynchronous control network. This stall rapidly propagates throughout the control, bringing the processor

384	The AMULET Asynchronous ARM Processors

to an inactive, zero power state. An active interrupt request releases the stall, enabling the processor to resume normal throughput immediately.

AMULET2 can therefore switch between zero power and maximum throughput states at a very high rate and with no software overhead; indeed, much existing ARM code will give optimum power-efficiency using this scheme even though it was not written with the scheme in mind. This makes the processor very applicable to low-power applications with bursty real-time load characteristics.

14.4AMULET2e

	AMULET2e is an AMULET2 processor core (see Section 14.3 on page 381) com-
	bined with 4 Kbytes of memory, which can be configured either as a cache or a fixed
	RAM area, and a flexible memory interface (the funnel) which allows 8-, 16or
	32-bit external devices to be connected directly, including memories built from
	DRAM. The internal organization of AMULET2e is illustrated in Figure 14.7.
AMULET2e	The cache comprises four 1 Kbyte blocks, each of which is a fully associative
cache	random replacement store with a quad-word line and block size. A pipeline register
	between the CAM and the RAM sections allows a following access to begin its
	CAM look-up while the previous access completes within the RAM; this exploits

Figure 14.7 AMULET2e internal organization.

AMULET2e

385

the ability of the AMULET2 core to issue multiple memory requests before the data is returned from the first. Sequential accesses are detected and bypass the CAM look-up, thereby saving power and improving performance.

	Cache line fetches are non-blocking, accessing the addressed item first and then
	allowing the processor to continue while the rest of the line is fetched. The line fetch
	automaton continues loading the line fetch buffer while the processor accesses the
	cache. There is an additional CAM entry that identifies references to the data which is
	stored in the line fetch buffer. Indeed, this data remains in the line fetch buffer where it
	can be accessed on equal terms to data in the cache until the next cache miss, where-
	upon the whole buffer is copied into the cache while the new data is loaded from
	external memory into the line fetch buffer.
AMULET2e	A plot of the AMULET2e die is shown in Figure 14.8 and the characteristics of the
silicon	device are summarized in Table 14.3 on page 386. The AMULET2 core uses 93,000
	transistors, the cache 328,000 and the remainder are in the control logic and pads.

Figure 14.8 AMULET2e die plot.

Timing reference The absence of a reference clock in an asynchronous system makes timing memory accesses an issue that requires careful consideration. The solution incorporated into AMULET2e uses a single external reference delay connected directly to the chip and configuration registers, loaded at start-up, which specify the organization and timing properties of each memory region. The reference delay could, for example, reflect the external SRAM access time, so the RAM will be configured to take one

386 The AMULET Asynchronous ARM Processors

Table 14.3 AMULET2e characteristics.

Process	0.5 urn	Transistors	454,000	MIPS	40
Metal layers	3	Die area	41mm2	Power	140 mW
Vdd	3.3V	Clock	none	MIPS/W	285

	reference delay. The slower ROM may be configured to take several reference
	delays. (Note that the reference delay is only used for off-chip timing; all on-chip
	delays are self-timed.)
AMULET26	AMULET2e has been configured to make building small systems as straightforward
systems	as possible. As an example, Figure 14.9 shows a test card
incorporating	AMULET2e. The only components, apart from AMULET2e itself, are four SRAM

	chips, one ROM chip, a UART and an RS232 line interface. The UART uses a crys-
	tal oscillator to control its bit rate, but all the system timing functions are controlled
	byAMULET2e.
	The ROM contains the standard ARM 'Angel' code and the host computer at the
	other end of the RS232 serial line runs the ARM development tools. This system dem-
	onstrates that using an asynchronous processor need be no more difficult than using a
	clocked processor provided that the memory interface has been carefully thought out.

chip selects 18

Figure 14.9 AMULET2e test card organization.

AMULET3

387

14.5AMULETS

	AMULETS is being developed to establish the commercial viability of asynchro-
	nous design. Like its predecessors, AMULETS is a full-functionality
	ARM-compatible microprocessor with support for interrupts and memory
	faults. AMULET 1 and AMULET2 implemented the ARM6 architecture (ARM
	architecture version 3G). AMULETS supports ARM architecture version 4T,
	including the 16-bit Thumb instruction set.
Performance	The objective of the AMULETS project was to produce an asynchronous implemen-
Objective	tation of ARM architecture v4T which is competitive in terms of power-efficiency
	and performance with the ARM9TDMI. This implies a performance target of over
	100 MIPS (measured using Dhrystone 2.1) on a 0.35 (im process, compared to the
	40 MIPS delivered by AMULET2e on a 0.5 um process.
	Increasing the performance by more than a factor of two requires a radical change
	to the core organization. As with a clocked processor, the basic approach is based on a
	combination of increasing the cycle rate of the processor pipeline and decreasing the
	average number of cycles per instruction. Here, however, the 'cycles' are not defined
	by an external clock and are not of fixed duration.
AMULET3 core	The organization of AMULETS is illustrated in Figure 14.10 on page 388. The six
organization	principal pipeline stages are as follows:
	• the instruction prefetch unit, which includes a branch target buffer;
	• the instruction decode, register read and forwarding stage;
	• the execute stage, which includes the shifter, multiplier and ALU;
	• the data memory interface;
	• the reorder buffer;
	• the register result write-back stage.
	The core employs a Harvard architecture (separate instruction and data memory
	ports) which provides the higher memory bandwidth required to support higher per-
	formance. Only those instructions that require data memory access (loads and stores)
	pass through the data memory, so the pipeline depth is greater for these instructions
	than, for example, for simple data processing instructions.
	Further details on each of the pipeline stages are given below.
Instruction	The prefetch unit operates autonomously, fetching 32-bit (one ARM or two Thumb)
prefetch unit	instruction packets from memory whenever it has room to store them.
	It incorporates a branch prediction unit that is similar to the jump trace buffer used in
	AMULET2, but has extensions to support branch prediction in Thumb code. A two-

388	The AMULET Asynchronous ARM Processors

Figure 14.10 AMULETS core organization.

instruction Thumb packet may contain zero, one or two branch instructions, and the branch predictor must handle all of these cases. When executing Thumb code the 16-entry branch prediction unit works in two 8-entry halves, with branches at even half-word addresses being stored in one half and branches at odd half-word addresses in the other. A packet with a branch in each half word may 'hit' in both halves, whereupon the target of the first instruction (at the even address) takes priority.

AMULET3 includes a zero-power 'Halt' instruction, as did AMULET2. Here the halt takes effect in the prefetch unit rather than the execution unit, giving a faster resumption of full throughput when an interrupt occurs. Interrupts themselves are also handled here, again giving reduced latency.

Decode and register read

The decode stage includes Thumb and ARM decode logic, and the register read and forwarding mechanisms.

The ARM7TDMI implemented the Thumb decode function by first converting Thumb instructions into their ARM equivalents, and then performing ARM decode. ARM9TDMI reduced decode latency by decoding Thumb instructions directly. AMULET3 adopts a middle way, decoding certain time-critical control signals directly and others via the ARM instruction decoder. The asynchronous pipeline

<<< < Предыдущая 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 3940 / 4340 41 42 43 > Следующая >>>

Соседние файлы в предмете Электротехника

#
23.08.2013177.78 Кб21Fuller J.P.MSW Logo.A simplified reference.1998.pdf
#
23.08.2013575.81 Кб9funct_a_l.pdf
#
23.08.2013252.09 Кб7funct_m_q.pdf
#
23.08.2013467.58 Кб7funct_r_z.pdf
#
23.08.2013223.68 Кб11Fung R.Improving accuracy of ADCs.pdf
#
23.08.201318.35 Mб98Furber S.ARM system-on-chip architecture.2000.pdf
#
23.08.20131.01 Mб9Gale T.GTK tutorial.V1.2.2000.pdf
#
23.08.2013732.38 Кб41Gauld A.Learning to program (Python).pdf
#
23.08.20131.34 Mб23Gauld A.Learning to program (Python)_1.pdf
#
23.08.201350.44 Кб13Gay D.M.Correctly rounded binary-decimal and decimal-binary conversions.pdf
#
23.08.20131.18 Mб106General electric VAT200.pdf