Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Furber S.ARM system-on-chip architecture.2000

.pdf
Скачиваний:
89
Добавлен:
23.08.2013
Размер:
18.35 Mб
Скачать

The Strong ARM SA-110

329

tional logic functions are fitted within their respective pipeline stages and do not add to the pipeline depth. The pipeline stages are:

1.Instruction fetch (from the instruction cache).

2.Instruction decode and register read; branch target calculation and execution.

3.Shift and ALU operation, including data transfer memory address calculation.

4.Data cache access.

5.Result write-back to register file.

The organization of the major pipeline components is illustrated in Figure 12.8 on page 330. The shaded bars delimit the pipeline stages and data which passes across these bars will be latched at the crossing point. Data which feeds around the end of a bar is being passed back or forward across pipeline stages, for example:

The register forwarding paths which pass intermediate results on to following instructions to avoid a register interlock stall caused by a read-after-write hazard.

The PC path which forwards pc + 4 from the fetch stage of the next instruction, giving pc + 8 for the current instruction, to be used as r15 and in the branch target calculation.

Pipeline features

Of particular note in this pipeline structure are:

The need for three register read ports to enable register-controlled shifts and store with base plus index addressing to issue in one cycle.

The need for two register write ports to allow load with auto-index to issue in one cycle.

The address incrementer in the execute stage to support load and store multiple instructions.

The large number of sources for the next PC value.

This last point reflects the many ways the ARM architecture allows the PC to be modified as a result of its visibility as r15 in the register bank.

PC modification The most common source of the next PC is the PC incrementer in the fetch stage; this value is available at the start of the next cycle, allowing one instruction to be fetched each cycle.

The next most common source of the PC is the result of a branch instruction. The dedicated branch displacement adder computes the target address during the instruction decode stage, causing a taken branch to incur a one-cycle penalty in addition to the cycle taken to execute the branch itself. Note that the displacement addition can take place in parallel with the instruction decode since the offset is a fixed field within the instruction; if the instruction turns out not to be a branch the computed target is simply discarded.

330

ARM CPU Cores

Figure 12.8 StrongARM core pipeline organization.

The StrongARM SA-110

331

 

Figure 12.9 StrongARM loop test pipeline behaviour.

 

The operation of the pipeline during this sequence is shown in Figure 12.9. Note how

 

the condition codes become valid just in time to avoid increasing the branch penalty; the

 

instruction immediately following the branch is fetched and discarded, but then the pro-

 

cessor begins fetching from the branch target. The target address is generated concur-

 

rently with the condition codes that determine whether or not it should be used.

 

The same one cycle penalty applies to branch and link instructions which take the

 

branch in the same way but use the execute and write stages to compute pc + 4 and write

 

it to r14, the link register. It also applies to the normal subroutine return instruction,

 

MOV pc, lr, where the target address comes from the register file rather than from the

 

branch displacement adder, but it is still available at the end of the decode stage.

 

If the return address must be computed, for example when returning from an

 

exception, there is a two-cycle penalty since the ALU result is only available at the end

 

of the execute stage, and where the PC is loaded from memory (from a jump table, or

 

subroutine return from stack) there is a three-cycle penalty.

Forwarding

The execution pipeline includes three forwarding paths to each register operand to

paths

avoid stalls when read-after-write hazards occur. Values are forwarded from:

 

1. the ALU result;

 

2. data loaded from the data cache;

 

3. the buffered ALU result.

 

These paths remove all data-dependency stalls except when a loaded data value is

 

used by the following instruction, in which case a single cycle stall is required.

332

ARM CPU Cores

Abort recovery

Multiplier

Instruction cache

Data cache

Synonyms

It might seem that one of these paths could be avoided if the ALU result were written into the register file in the following stage rather than buffering it and delaying the write to the last stage. The merit of the delayed scheme is that data aborts can occur during the data cache access, perhaps requiring remedial action to recover the base register value (for instance when the base register is overwritten in a load multiple sequence before the fault is generated). This scheme not only allows recovery but supports the cleanest recovery mechanism, leaving the base register value as it was at the start of the instruction and removing the need for the exception handler to undo any auto-indexing.

A feature of particular note is the StrongARM's multiplication unit which, despite the processor's high clock rate, computes at the rate of 12 bits per cycle, giving the product of two 32-bit operands in one to three clock cycles. The high-speed multiplier gives StrongARM considerable potential in applications which require significant digital signal processing performance.

The I-cache holds 16 Kbytes of instructions in 512 lines of eight instructions (32 bytes). The cache is 32-way associative (using a CAM-RAM organization) with a cyclic replacement algorithm and uses the processor's virtual address. The block size is the same as the line size, so whole lines are loaded from memory at a time. Individual areas of memory may be marked as cacheable or uncacheable using the memory management tables, and the cache may be disabled and flushed (in its entirety) under software control.

The data cache uses a similar organization to the instruction cache but with added functions to cope with data stores (instructions are read-only). It has a capacity of 16 Kbytes with 512 32-byte lines arranged as a 32-way virtually addressed associative cache with cyclic replacement. The block size is also 32 bytes. The cache may be disabled by software and individual regions of memory may be made uncacheable. (Making I/O regions uncacheable is usually a good idea.)

The data cache uses a copy-back write strategy, and has two dirty bits per line so that when a line is evicted from the cache all, half or none of it is written to memory. The use of two dirty bits rather than one reduces memory traffic since the 'half dirty' case is quite common. The cache stores a copy of the physical address with each line for use when the line is written back to memory, and evicted lines are placed into the write buffer.

Since the cache uses a copy-back write strategy, it is sometimes necessary to cause all dirty lines to be written back to memory. On StrongARM this is achieved using software to load new data into every line, causing dirty lines to be evicted.

As with any virtually addressed cache, care must be taken to ensure that all cacheable physical memory locations have unique virtual addresses that map to them. When two different virtual addresses map to the same physical location the virtual addresses are synonyms; where synonyms exist, neither virtual address should be cacheable.

The StrongARM SA-110

333

Cache consistency

Compiler issues

The write buffer

The separate instruction and data caches create the potential for the two caches to have inconsistent copies of the same physical memory location. Whenever a region of memory is treated as (writeable) data at one time and as executable instructions at another, great care must be taken to avoid inconsistencies.

A common example of this situation is when a program is loaded (or copied from one memory location to another) and then executed. During the load phase the program is treated as data and passes through the data cache. When it is executed it is loaded into the instruction cache (which may have copies of previous instructions from the same addresses). To ensure correct operation:

1.The load phase should be completed.

2.The entire data cache should be 'cleaned' (by loading new data into every line as described above) or, where the addresses of the affected cache lines are known, these lines may be explicitly cleaned and flushed.

3.The instruction cache should be flushed (to remove obsolete instructions).

Alternative solutions might involve making certain regions of memory uncacheable during the load phase.

Note that this is not a problem for literals (data items included in the instruction stream) which are quite common in ARM code. Although a block of memory may be loaded into both caches since it contains a mixture of instructions and literals, so long as individual words (or bytes) are treated consistently as either instructions or data, there will be no problem. It is even acceptable for the program to change the value of a literal (though this is rarely used and is probably bad practice) so long as it does not affect the values of instructions which may be in the instruction cache. It is better practice, however, to avoid literals altogether and to keep data in data areas that are separate from code areas which contain instructions.

The separation of the instruction and data caches should be observed by the compiler, which should pool constants across compiler units rather than placing them at the end of each routine. This minimizes the pollution of the data cache with instructions and of the instruction cache with data.

The write buffer smooths out small peaks in the write data bandwidth, decoupling the processor from stalls caused by the memory bus saturating. A large peak that causes the buffer to fill will stall the processor. Independent writes to the same 16-byte area are merged within the write buffer, though only to the last address written into the buffer. The buffer stores up to eight addresses (each address aligned to a 16-byte boundary), copying the virtual address for use when merging writes and the physical address to address the external memory, and up to 16 bytes of data for each address (so each address can handle a dirty half-line or up to four registers from a single store multiple).

334 ARM CPU Cores

The write buffer may be disabled by software, and individual memory regions may be marked as bufferable or unbufferable using the MMU page tables. All cacheable regions are bufferable (evicted cache lines are written through the write buffer) but uncacheable regions may be bufferable or unbufferable. Normally the I/O region is unbufferable. An unbuffered write will wait for the write buffer to empty before it is

 

written to memory.

 

Data reads that miss the data cache are checked against entries in the write buffer to

 

ensure consistency, but instruction reads are not checked against the write buffer.

 

Whenever memory locations which have been used as data are to be used as instruc-

 

tions a special instruction should be used to ensure that the write buffer has drained.

MMU

StrongARM incorporates the standard ARM memory management architecture,

organization

using separate translation look-aside buffers (TLBs) for instructions and data. Each

 

TLB has 32 translation entries arranged as a fully associative cache with cyclic

 

replacement. A TLB miss invokes table-walking hardware to fetch the translation

 

and access permission information from main memory.

StrongARM

A photograph of a StrongARM die is shown in Figure 12.10 with an overlay indicat-

SJlicon

ing the major functional areas. The die area is, not surprisingly, dominated by the

 

instruction cache (1CACHE) and the data cache (DCACHE). Each cache has its own

 

MMU (IMMU and DMMU). The processor core has the instruction issue unit

 

(IBOX) and the execution unit (EBOX) with high-speed multiplication hardware

 

(MUL). The write buffer and external bus controller complete the processor logic.

Figure 12.10 StrongARM die photograph.

The ARM920T and ARM940T

335

The high-frequency on-chip clock is generated by a phase-locked loop (PLL) from an external 3.68 MHz clock input.

The characteristics of the StrongARM are summarized in Table 12.4.

Table 12.4 StrongARM characteristics.

12.4The ARM920T and ARM940T

 

The ARM920T and ARM940T are based upon the ARM9TDMI processor core (see

 

Section 9.3 on page 260), to which instruction and data caches have been added.

 

The instruction and data ports are merged via an AMBA bus master unit, and a

 

write buffer and memory management unit (ARM920T) or memory protection unit

 

(ARM940T) are also incorporated.

ARM920T

The overall organization of the ARM920T is illustrated in Figure 12.11 on

 

page 336.

ARM920T

The instruction and data caches are both 16 Kbytes in size and are built using a seg-

caches

mented CAM-RAM organization to give 64-way associativity. Each cache comprises

 

eight segments of 64 lines, the required segment being addressed by A[7:5]. They

 

have 8-word (32 byte) lines and support lock-down in units of 256 bytes (correspond-

 

ing to one line in each segment). The replacement strategy is pseudo-random or

 

round-robin, and is determined by the 'RR' bit (bit 14) in CP15 register 1. A complete

 

8-word line is reloaded on a cache miss.

 

The instruction cache is read-only. The data cache supports a copy-back write strat-

 

egy, and each line has one valid bit, two dirty bits and a write-back bit. The write-back

 

bit duplicates information usually found in the translation system, and enables the

 

cache to implement a write operation as write-through or copy-back without reference

 

to the MMU. When a cache line is replaced any dirty data is sent to the write buffer,

 

and this may amount to zero, four or eight words depending on the two dirty bits. The

 

data cache allocates space only on a read miss, not on a write miss.

 

As the data cache is accessed using virtual addresses, there is a problem whenever

 

dirty data has to be written back to main memory since this action requires the physical

 

address. The ARMS 10 solves this by passing the virtual address back to the MMU

336

ARM920T write buffer

ARM920T MMU

ARM CPU Cores

Figure 12.11 ARM920T organization.

for translation, but of course there is no guarantee that the necessary translation entry is still in the TLB. The whole process can therefore be quite time consuming. The ARM920T avoids this problem by having a second tag store that is used to hold the physical address for each line in the cache. A cache line flush does not then need to involve the MMU and can always proceed without delay to transfer the dirty data to the write buffer.

The ARM920T can force dirty cache lines to be flushed back to main memory (a process known as 'cleaning') using either the cache index or the memory address. It is therefore possible to clean all of the entries corresponding to a particular memory area.

The write buffer will hold up to four addresses and 16 data words.

The MMU implements the memory management architecture described in Section 11.6 on page 302 and is controlled by the system control coprocessor,

The ARM920T and ARM940T

337

CP15, as described in Section 11.5 on page 298. As the ARM920T has separate instruction and data memory ports it has two MMUs, one for each of the ports.

The memory management hardware includes a 64-entry TLB for the instruction memory port and a 64-entry TLB for the data port. The ARM920T includes the ProcessID logic that is required to support Windows CE. The caches and MMUs come after the ProcessID insertion, so a context switch doesn't invalidate either the caches or the TLBs. In addition to supporting the 64 Kbyte large pages and the 4 Kbyte small pages, the ARM920T MMU also supports 1 Kbyte 'tiny' page translation.

The ARM920T MMU supports selective lock-down for TLB entries, ensuring that translations entries that are critical, for example to a real-time process, cannot be ejected.

ARM920T silicon

The characteristics of the ARM920T are summarized in Table 12.5.

 

Table 12.5 ARM920T characteristics.

 

 

 

 

 

 

 

 

Process

0.25 urn Transistors

2,500,000

MIPS

220

 

Metal layers

4

Core area

23-25 mm2

Power

560mW

 

Vdd

2.5V

Clock

0-200 MHz MIPS/W

390

 

 

 

 

ARM940T

The ARM940T is another CPU core based on the ARM9TDMI integer core. It is

 

simpler than the ARM920T as it does not support virtual to physical address transla-

 

tion. The organization of the ARM940T is shown in Figure 12.12 on page 338.

Memory protection unit

ARM940T caches

The memory protection unit implements the architecture described in Section 11.4 on page 297. As the ARM940T has separate instruction and data memory ports it has two protection units, one for each of the ports.

This configuration has no virtual to physical address translation mechanism. Many embedded systems do not require address translation, and as a full MMU requires significant silicon area, its omission when it is not required represents a significant cost saving. The AMBA interface and absence of address translation hardware both indicate the expectation that the application will be in embedded systems with other AMBA macrocells on the same chip.

Both instruction and data caches have a size of 4 Kbytes and comprise four 1 Kbyte segments, each of which uses a fully associative CAM-RAM structure. (For an explanation of these cache terms see Section 10.3 on page 272.) They have a quad-word line structure, and always load a complete line on a cache miss if the address is cacheable as indicated by the memory protection unit.

338

ARM CPU Cores

Figure 12.12 ARM940T organization.

The caches supports lock-down, which means that sections of the cache can be loaded and then the contents protected from being flushed. Locked-down sections of the cache are clearly unavailable for further general-purpose cache use, but the ability to guarantee that certain critical code sections will always be found in the cache may be more important than getting the best cache hit rate. The caches may be locked down in 16-word units.

The ARM9TDMI instruction port is used to load instructions and therefore only performs read operations. The only coherency problem arises when the program that the cache holds copies of is modified in main memory. Any modification of code (such as loading a new program into memory previously occupied by a program that is no longer required) must be handled carefully, and the instruction cache selectively or completely flushed before the new program is executed.

The ARM9TDMI data port supports writes as well as reads, and therefore the data cache must implement some sort of write strategy (see 'Write strategies' on page 278).