![](/user_photo/_userpic.png)
Furber S.ARM system-on-chip architecture.2000
.pdf![](/html/616/253/html_XmXTL4PZdq.q1mw/htmlconvd-ESWtSu331x1.jpg)
The ARM710T, ARM720T and ARM740T |
319 |
tag RAM I |
tag RAM |
tag RAM |
Figure 12.2 The ARM71OT cache organization.
four possible locations will be overwritten by new data on a cache miss. The cache uses a write-through strategy as the target clock rate is only a few times higher than the rate at which standard off-chip memory devices can cycle.
The organization of the ARM710T cache is illustrated in Figure 12.2. Bits [10:4] of the virtual address are used to index into each of the four tag stores. The tags contain bits [31:11] of the virtual addresses of the corresponding data, so these tags are compared with bits [31:11] of the current virtual address. If one of the tags matches, the cache has hit and the corresponding line can be accessed from the data RAM using the same index (bits [10:4] of the virtual address) together with two bits which encode the number of the tag store which produced the matching tag. Virtual address bits [3:2] select the word from the line and, if a byte or half-word access is requested, bits [1:0] select the byte or half-word from the word.
It is interesting to compare the ARM710T cache organization with that of the ARM3 and ARM610 (described in detail in Section 10.4 on page 279) since they illustrate the sorts of issues that arise in designing a cache for both good performance and low-power operation. Although there is, as yet, no final word on the correct way to design a cache for this sort of application, the designer can be guided by the following observations on the examples presented in these ARM chips.
![](/html/616/253/html_XmXTL4PZdq.q1mw/htmlconvd-ESWtSu334x1.jpg)
TheARMSIO |
323 |
is of benefit to embedded applications that run fixed software systems where the overhead of full address translation cannot be justified. There are also performance and power-efficiency advantages in omitting address translation hardware (where it is not needed), since a TLB miss results in several external memory accesses.
The organization of the ARM740T is illustrated in Figure 12.4 on page 324.
ARM740T silicon |
The characteristics of an ARM740T implemented on a 0.35 um CMOS process are |
||||
|
summarized in Table 12.2. |
|
|
||
|
As can be seen from Table 12.2, the general characteristics of the ARM740T are |
||||
|
the same as those of the ARM710T and 720T as listed in Table 12.1 on page 322. The |
||||
|
only significant differences are the saving of just under 2 mm2 of core area due to the |
||||
|
omission of the MMU, and the reduced power consumption. |
||||
|
Table 12.2 ARM740T characteristics. |
|
|
||
|
|
|
|
|
|
Process |
0.35 um |
Transistors |
N/A MIPS |
53 |
|
Metal layers |
3 |
Core area |
9.8 mm2 Power |
175 mW |
|
Vdd |
3.3V |
Clock |
0-59 MHz MIPS/W |
300 |
|
|
|
|
|
|
|
12.2TheARMSIO
The ARMS 10 is a high-performance ARM CPU chip with an on-chip cache and memory management unit. It was the first implementation of the ARM instruction set developed by ARM Limited to use a fundamentally different pipeline structure from that used on the original ARM chip designed at Acorn Computers and carried through to ARM6 and ARM7. The ARMS 10 has now been superseded by the ARM9 series.
The ARMS core is the integer processing unit used in the ARMS 10. It was described in Section 9.2 on page 256. The ARMS 10 adds the following on-chip support to the basic CPU:
•An 8 Kbyte virtually addressed unified instruction and data cache using a copy-back (or write-through, controlled by the page table entry) write strategy and offering a double-bandwidth capability as required by the ARMS core. The cache is 64-way associative, and constructed from 1 Kbyte components to simplify the future development of a smaller cache for an embedded system chip or a larger cache on a more advanced process technology. It is designed so that areas of the
![](/html/616/253/html_XmXTL4PZdq.q1mw/htmlconvd-ESWtSu336x1.jpg)
324 |
ARM CPU Cores |
Figure 12.4 ARM740T organization.
cache can be 'locked down' to ensure that speed-critical sections of code, which arise in many embedded applications, do not get flushed.
•A memory management unit conforming to the ARM MMU architecture described in Section 11.6 on page 302 using the system control coprocessor described in Section 11.5 on page 298.
•A write buffer to allow the processor to continue while the write to external memory completes.
The organization of the ARM810 is illustrated in Figure 12.5 on page 325.
Double-ba |
The core's double-bandwidth requirement is satisfied by the cache; external memory |
ndwidth |
accesses use conventional line refill and individual data transfer protocols. |
cache |
Double-bandwidth is available from the cache only for sequential memory accesses, |
|
so it is exploited by the prefetch unit for instruction fetches and by the core for load |
|
multiple register instructions. |
|
Since the pipeline organization used on ARM7TDMI uses the memory interface |
|
almost every cycle, some way must be found to increase the available memory band- |
|
width if the CPI (the number of clocks per instruction) of the processor is to be |
![](/html/616/253/html_XmXTL4PZdq.q1mw/htmlconvd-ESWtSu337x1.jpg)
The ARM810 |
325 |
improved. The StrongARM (described in the next section) achieves an increased bandwidth by incorporating separate instruction and data caches, thereby making double the bandwidth potentially available (though not all of this bandwidth can be used since the ARM generates instruction traffic with around twice the bandwidth of the data traffic, so the effective increase in bandwidth is approximately 50%). The ARMS 10 delivers an increased bandwidth by returning two sequential words per clock cycle which, since typically around 75% of an ARM's memory accesses are sequential, increases the usable bandwidth by about 60%. Although the ARMS 10 approach gives more bandwidth, it also creates more opportunity for conflict between instruction and data accesses; evaluating the relative merits of the two approaches is not straightforward.
As the cache uses a copy-back write strategy and is virtually addressed, evicting a dirty line requires an address translation. The mechanism used on the ARMS 10 is to send the virtual tag to the MMU for translation.
![](/html/616/253/html_XmXTL4PZdq.q1mw/htmlconvd-ESWtSu338x1.jpg)
326 |
ARM CPU Cores |
Figure 12.6 ARM810 die photograph.
ARM810 silicon A photograph of an ARM810 die is shown in Figure 12.6. The ARMS core datapath is visible at the top of the photograph with its control logic immediately below. The MMU TLB is in the top right-hand corner and the eight 1 Kbyte cache blocks occupy the bottom area of the chip.
The characteristics of the ARMS 10 are summarized in Table 12.3.
Table 12.3 ARMS 10 characteristics.
Process |
0.5 u.m |
Transistors |
836,022 |
MIPS |
86 |
Metal layers |
3 |
Die area |
76 mm |
Power |
500 mW |
Vdd |
3.3V |
Clock |
0-72 MHz |
MIPSAV |
172 |
|
|
|
|
|
|
The StrongARM SA-110 |
327 |
12.3The StrongARM SA-110
|
The StrongARM CPU was developed by Digital Equipment Corporation in collabo- |
|
ration with ARM Limited. It was the first ARM processor to use a modified-Harvard |
|
(separate instruction and data cache) architecture. |
|
The SA-110 is now manufactured by Intel Corporation following their take-over of |
|
Digital Semiconductor in 1998. |
Digital Alpha |
Digital were, perhaps, best known in the microprocessor business for their range of |
background |
'Alpha' microprocessors which are 64-bit RISC processors that operate at very high |
|
clock rates. The ability to sustain these clock frequencies is a result of advanced |
|
CMOS process technology, carefully balanced pipeline design, a very thoroughly |
|
engineered clocking scheme and in-house design tools that give unusually good |
|
control of all these factors. |
|
The same approach has been applied to the design of StrongARM, with the added |
|
objective of achieving exceptional power-efficiency. |
StrongARM |
The organization of StrongARM is shown in Figure 12.7 on page 328. Its main fea- |
organization |
tures are: |
|
• A 5-stage pipeline with register forwarding. |
|
• Single-cycle execution of all common instructions except 64-bit multiplies, mul |
|
tiple register transfers and the swap memory and register instruction. |
|
• A 16 Kbyte 32-way associative instruction cache with 32-byte line. |
|
• A 16 Kbyte 32-way associative copy-back data cache with 32-byte line. |
|
• Separate 32-entry instruction and data translation look-aside buffers. |
|
• An 8-entry write buffer with up to 16 bytes per entry. |
|
• Pseudo-static operation with low power consumption. |
|
The processor uses system control coprocessor 15 to manage the on-chip MMU |
|
and cache resources, and incorporates JTAG boundary scan test circuitry for testing |
|
printed circuit board connectivity (the JTAG 'in-test', for in-circuit testing of the |
|
device itself, is not supported). |
|
The first StrongARM chips are implemented on Digital's 0.35 um CMOS process, |
|
using three layers of metal. They use around 2.5 million transistors on a 50 mm2 die |
|
(which is very small for a processor of this performance) and deliver 200 to 250 |
|
Dhrystone MIPS with a 160 to 200 MHz clock, dissipating half to just under one watt |
|
using a 1.65 V to 2 V supply. |
![](/html/616/253/html_XmXTL4PZdq.q1mw/htmlconvd-ESWtSu340x1.jpg)
328 |
ARM CPU Cores |
Figure 12.7 StrongARM organization.
The StrongARM The processor core employs a 'classic' 5-stage pipeline with full bypassing (register processor core forwarding) and hardware interlocks. The ARM instruction set requires some instruction decoding to take place before the register bank read accesses can begin,
and it also requires a shift operation in series with the ALU, but both of these addi-