Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Furber S.ARM system-on-chip architecture.2000

.pdf
Скачиваний:
89
Добавлен:
23.08.2013
Размер:
18.35 Mб
Скачать

The ARM710T, ARM720T and ARM740T

319

tag RAM I

tag RAM

tag RAM

Figure 12.2 The ARM71OT cache organization.

four possible locations will be overwritten by new data on a cache miss. The cache uses a write-through strategy as the target clock rate is only a few times higher than the rate at which standard off-chip memory devices can cycle.

The organization of the ARM710T cache is illustrated in Figure 12.2. Bits [10:4] of the virtual address are used to index into each of the four tag stores. The tags contain bits [31:11] of the virtual addresses of the corresponding data, so these tags are compared with bits [31:11] of the current virtual address. If one of the tags matches, the cache has hit and the corresponding line can be accessed from the data RAM using the same index (bits [10:4] of the virtual address) together with two bits which encode the number of the tag store which produced the matching tag. Virtual address bits [3:2] select the word from the line and, if a byte or half-word access is requested, bits [1:0] select the byte or half-word from the word.

It is interesting to compare the ARM710T cache organization with that of the ARM3 and ARM610 (described in detail in Section 10.4 on page 279) since they illustrate the sorts of issues that arise in designing a cache for both good performance and low-power operation. Although there is, as yet, no final word on the correct way to design a cache for this sort of application, the designer can be guided by the following observations on the examples presented in these ARM chips.

320

ARM CPU Cores

Cache speed

Cache power

Sequential accesses

Power optimization

High-associativity caches give the best hit rate, but require sequential CAM then RAM accesses which limits how fast the cycle time can become. Caches with a lower associativity can perform parallel tag and data accesses to give faster cycle times, and although a direct mapped cache has a significantly lower hit rate than a fully associative one, most of the associativity benefits accrue going from direct-mapped to 2- or 4-way associative; beyond 4-way the benefits of increased associativity are small. However, a fully associative CAM-RAM cache is much simpler than a 4-way associative RAM-RAM cache.

CAM is somewhat power-hungry, requiring a parallel comparison with every entry on each cycle. Segmenting the cache by reducing the associativity a little and activating only a subsection of the CAM reduces the power cost significantly for a small increase in complexity.

In a static RAM the main users of power are the analogue sense-amplifiers. A 4-way cache must activate four times as many sense-amplifiers in the tag store as a direct-mapped cache; if it exploits the speed advantage offered by parallel tag and data accesses, it will also uses four times as many sense-amplifiers in the data store. (RAM-RAM caches can, alternatively, perform serial tag and data accesses to save power, only activating a particular data RAM when a hit is detected in the corresponding tag store.) Waste power can be minimized by using self-timed power-down circuits to turn off the sense-amplifiers as soon as the data is valid, but the power used in the sense-amplifiers is still significant.

Where the processor is accessing memory locations which fall within the same cache line it should be possible to bypass the tag look-up for all but the first access. The ARM generates a signal which indicates when the next memory access will be sequential to the current one, and this can be used, with the current address, to deduce that the access will fall in the same line. (This does not catch every access within the same line, but it catches nearly all of them and is very simple to implement.)

Where an access will be in the same line, bypassing the tag look-up increases the access speed and saves power. Potentially, sequential accesses could use slower sense-amplifiers (possibly using standard logic rather than analogue circuits) and save considerable power, though care must be taken to ensure that an increased voltage swing on the bit lines does not cost more energy than is saved in the sense-amplifiers.

The cache designer must remember that the goal is to minimize the overall system power, not just the cache power. Off-chip accesses cost a lot more energy than on-chip accesses, so the first priority must be to find a cache organization which gives a good hit rate. Deciding between a highly associative CAM—RAM organization or a set-associative RAM—RAM organization requires a detailed investigation of all of the design issues, and may be strongly influenced by low-level circuit issues such as novel ways to build power-efficient sense-amplifiers or CAM hit detectors.

The ARM710T, ARM720T and ARM740T

321

ARM710TMMU

ARM710T write buffer

Exploiting sequential accesses to save power and to increase performance is always a good idea. Typical dynamic execution statistics from the ARM show that 75% of all accesses are sequential, and since sequential accesses are fundamentally easier to deal with, this seems to be a statistic that should not be overlooked. Where power-efficiency is paramount, it may be worth sacrificing performance by making nonsequential accesses take two clock cycles; this will only reduce performance by about 25% and may reduce the cache power requirements by a factor of two or three.

An interesting question for research into low power is to ask whether the best cache organization for power-efficiency is necessarily the same as the best organization for high performance

The ARM710T memory management unit implements the ARM memory management architecture described in Section 11.6 on page 302 using the system control coprocessor described in Section 11.5 on page 298.

The translation look-aside buffer (TLB) is a 64-entry associative cache of recently used translations which accelerates the translation process by removing the need for the 2-stage table look-up in a high proportion of accesses.

The write buffer holds four addresses and eight data words. The memory management unit defines which addresses are bufferable. Each address may be associated with any number of the data words, so the write buffer may hold one word (or byte) of data to write to one address and seven words to write to another address, or two blocks of four words to write to different addresses, and so on. The data words associated with a particular address are written to sequential memory locations starting at that address.

(Clearly multiple data words associated with one address are generated principally by store multiple register instructions, the only other potential source being a data transfer from an external coprocessor.)

The mapping is illustrated in Figure 12.3 on page 322, which shows the first address mapping six data words and the second and third addresses mapping one data word each. The fourth address is currently unused. The write buffer becomes full either when all four addresses are used or when all eight data words are full.

The processor can write into the buffer at the fast (cache) clock speed and continue executing instructions from the cache while the write buffer stores the data to memory at the memory clock rate. The processor is therefore fully decoupled from the memory speed so long as the instructions and data it needs are in the cache and the write buffer is not full when it comes to perform a write operation. The write buffer gives a performance benefit of around 15% for a modest hardware cost.

The principal drawback of the write buffer is that it is not possible to recover from external memory faults caused by buffered writes since the processor state is not recoverable. The processor can still support virtual memory since translation faults are detected in the on-chip MMU, so the exception is raised before the data gets to the

322

ARM CPU Cores

Figure 12.3 Write buffer mapping example.

write buffer. But, for example, a memory error correction scheme based on software error recovery cannot be supported if buffered writes are enabled.

ARM710T Silicon

The characteristics of an ARM710T implemented on a 0.35 urn CMOS process

 

are summarized in Table 12.1.

 

 

 

 

Table 12.1 ARM710T characteristics.

 

 

 

 

 

 

 

 

 

Process

0.35 \im

Transistors

N/A

MIPS

53

 

Metal layers

3

Core area

11.7mm2

Power

240 mW

 

Vdd

3.3V

Clock

0-59 MHz MIPSAV

220

 

 

 

 

 

 

 

 

ARM720T

ARM740T

The ARM720T is very similar to the ARM710T with the following extensions:

Virtual addresses in the bottom 32 Mbytes of the address space can be relocated to the 32 Mbyte memory area specified in the ProcessID register (CP15 register 13).

The exception vectors can be moved from the bottom of memory to OxffffOOOO, thereby preventing them from being translated by the above mechanism. This function is controlled by the 'V bit in CP15 register 1.

These extensions are intended to enhance the ability of the CPU core to support the Windows CE operating system. They are implemented in the CP15 MMU control registers described in Section 11.5 on page 298.

The ARM740T differs from the ARM710T only in having a simpler memory protection unit in place of the 710T's memory management unit. The memory protection unit implements the architecture described in Section 11.4 on page 297 using the system control coprocessor described in Section 11.3 on page 294.

The protection unit does not support virtual to physical memory address translation, but provides basic protection and cache control functionality in a lower cost form. This

TheARMSIO

323

is of benefit to embedded applications that run fixed software systems where the overhead of full address translation cannot be justified. There are also performance and power-efficiency advantages in omitting address translation hardware (where it is not needed), since a TLB miss results in several external memory accesses.

The organization of the ARM740T is illustrated in Figure 12.4 on page 324.

ARM740T silicon

The characteristics of an ARM740T implemented on a 0.35 um CMOS process are

 

summarized in Table 12.2.

 

 

 

As can be seen from Table 12.2, the general characteristics of the ARM740T are

 

the same as those of the ARM710T and 720T as listed in Table 12.1 on page 322. The

 

only significant differences are the saving of just under 2 mm2 of core area due to the

 

omission of the MMU, and the reduced power consumption.

 

Table 12.2 ARM740T characteristics.

 

 

 

 

 

 

 

 

Process

0.35 um

Transistors

N/A MIPS

53

 

Metal layers

3

Core area

9.8 mm2 Power

175 mW

 

Vdd

3.3V

Clock

0-59 MHz MIPS/W

300

 

 

 

 

 

 

 

12.2TheARMSIO

The ARMS 10 is a high-performance ARM CPU chip with an on-chip cache and memory management unit. It was the first implementation of the ARM instruction set developed by ARM Limited to use a fundamentally different pipeline structure from that used on the original ARM chip designed at Acorn Computers and carried through to ARM6 and ARM7. The ARMS 10 has now been superseded by the ARM9 series.

The ARMS core is the integer processing unit used in the ARMS 10. It was described in Section 9.2 on page 256. The ARMS 10 adds the following on-chip support to the basic CPU:

An 8 Kbyte virtually addressed unified instruction and data cache using a copy-back (or write-through, controlled by the page table entry) write strategy and offering a double-bandwidth capability as required by the ARMS core. The cache is 64-way associative, and constructed from 1 Kbyte components to simplify the future development of a smaller cache for an embedded system chip or a larger cache on a more advanced process technology. It is designed so that areas of the

324

ARM CPU Cores

Figure 12.4 ARM740T organization.

cache can be 'locked down' to ensure that speed-critical sections of code, which arise in many embedded applications, do not get flushed.

A memory management unit conforming to the ARM MMU architecture described in Section 11.6 on page 302 using the system control coprocessor described in Section 11.5 on page 298.

A write buffer to allow the processor to continue while the write to external memory completes.

The organization of the ARM810 is illustrated in Figure 12.5 on page 325.

Double-ba

The core's double-bandwidth requirement is satisfied by the cache; external memory

ndwidth

accesses use conventional line refill and individual data transfer protocols.

cache

Double-bandwidth is available from the cache only for sequential memory accesses,

 

so it is exploited by the prefetch unit for instruction fetches and by the core for load

 

multiple register instructions.

 

Since the pipeline organization used on ARM7TDMI uses the memory interface

 

almost every cycle, some way must be found to increase the available memory band-

 

width if the CPI (the number of clocks per instruction) of the processor is to be

The ARM810

325

improved. The StrongARM (described in the next section) achieves an increased bandwidth by incorporating separate instruction and data caches, thereby making double the bandwidth potentially available (though not all of this bandwidth can be used since the ARM generates instruction traffic with around twice the bandwidth of the data traffic, so the effective increase in bandwidth is approximately 50%). The ARMS 10 delivers an increased bandwidth by returning two sequential words per clock cycle which, since typically around 75% of an ARM's memory accesses are sequential, increases the usable bandwidth by about 60%. Although the ARMS 10 approach gives more bandwidth, it also creates more opportunity for conflict between instruction and data accesses; evaluating the relative merits of the two approaches is not straightforward.

As the cache uses a copy-back write strategy and is virtually addressed, evicting a dirty line requires an address translation. The mechanism used on the ARMS 10 is to send the virtual tag to the MMU for translation.

326

ARM CPU Cores

Figure 12.6 ARM810 die photograph.

ARM810 silicon A photograph of an ARM810 die is shown in Figure 12.6. The ARMS core datapath is visible at the top of the photograph with its control logic immediately below. The MMU TLB is in the top right-hand corner and the eight 1 Kbyte cache blocks occupy the bottom area of the chip.

The characteristics of the ARMS 10 are summarized in Table 12.3.

Table 12.3 ARMS 10 characteristics.

Process

0.5 u.m

Transistors

836,022

MIPS

86

Metal layers

3

Die area

76 mm

Power

500 mW

Vdd

3.3V

Clock

0-72 MHz

MIPSAV

172

 

 

 

 

 

 

The StrongARM SA-110

327

12.3The StrongARM SA-110

 

The StrongARM CPU was developed by Digital Equipment Corporation in collabo-

 

ration with ARM Limited. It was the first ARM processor to use a modified-Harvard

 

(separate instruction and data cache) architecture.

 

The SA-110 is now manufactured by Intel Corporation following their take-over of

 

Digital Semiconductor in 1998.

Digital Alpha

Digital were, perhaps, best known in the microprocessor business for their range of

background

'Alpha' microprocessors which are 64-bit RISC processors that operate at very high

 

clock rates. The ability to sustain these clock frequencies is a result of advanced

 

CMOS process technology, carefully balanced pipeline design, a very thoroughly

 

engineered clocking scheme and in-house design tools that give unusually good

 

control of all these factors.

 

The same approach has been applied to the design of StrongARM, with the added

 

objective of achieving exceptional power-efficiency.

StrongARM

The organization of StrongARM is shown in Figure 12.7 on page 328. Its main fea-

organization

tures are:

 

• A 5-stage pipeline with register forwarding.

 

• Single-cycle execution of all common instructions except 64-bit multiplies, mul

 

tiple register transfers and the swap memory and register instruction.

 

• A 16 Kbyte 32-way associative instruction cache with 32-byte line.

 

• A 16 Kbyte 32-way associative copy-back data cache with 32-byte line.

 

• Separate 32-entry instruction and data translation look-aside buffers.

 

• An 8-entry write buffer with up to 16 bytes per entry.

 

• Pseudo-static operation with low power consumption.

 

The processor uses system control coprocessor 15 to manage the on-chip MMU

 

and cache resources, and incorporates JTAG boundary scan test circuitry for testing

 

printed circuit board connectivity (the JTAG 'in-test', for in-circuit testing of the

 

device itself, is not supported).

 

The first StrongARM chips are implemented on Digital's 0.35 um CMOS process,

 

using three layers of metal. They use around 2.5 million transistors on a 50 mm2 die

 

(which is very small for a processor of this performance) and deliver 200 to 250

 

Dhrystone MIPS with a 160 to 200 MHz clock, dissipating half to just under one watt

 

using a 1.65 V to 2 V supply.

328

ARM CPU Cores

Figure 12.7 StrongARM organization.

The StrongARM The processor core employs a 'classic' 5-stage pipeline with full bypassing (register processor core forwarding) and hardware interlocks. The ARM instruction set requires some instruction decoding to take place before the register bank read accesses can begin,

and it also requires a shift operation in series with the ALU, but both of these addi-