Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Furber S.ARM system-on-chip architecture.2000

.pdf
Скачиваний:
89
Добавлен:
23.08.2013
Размер:
18.35 Mб
Скачать

Cache design - an example

 

 

279

Table 10.1

Summary of cache organizational options.

 

 

 

 

 

 

 

Organizational feature

 

Options

 

 

Cache-MMU relationship

Physical cache

Virtual cache

 

 

Cache contents

 

Unified instruction

Separate instruction

 

 

 

 

and data cache

and data caches

 

 

Associativity

 

Direct-mapped

Set-associative

Fully associative

 

 

 

RAM-RAM

RAM-RAM

CAM-RAM

 

Replacement strategy

Round-robin

Random

LRU

 

Write strategy

 

Write-through

Write-through with

Copy-back

 

 

 

 

write buffer

 

 

 

 

 

 

 

 

10.4Cache design - an example

 

The choice of organization for a cache requires the consideration of several factors as

 

discussed in Section 10.3 on page 272, including the size of the cache, the degree of

 

associativity, the line and block sizes, the replacement algorithm, and the write strat-

 

egy. Detailed architectural simulations are required to analyse the effects of these

 

choices on the performance of the cache.

The ARMS

The ARMS, designed in 1989, was the first ARM chip to incorporate an on-chip

cache

cache, and detailed studies were carried out into the effects of these parameters on

 

performance and bus use. These studies used specially designed hardware to capture

 

address traces while running several benchmark programs on an ARM2; software was

 

then used to analyse these traces to model the behaviour of the various organizations.

 

(Today special hardware is generally unnecessary since desktop machines have suffi-

 

cient performance to simulate large enough programs without hardware support.)

 

The study started by setting an upper bound on the performance benefit that could

 

be expected from a cache. A 'perfect' cache, which always contains the requested

 

data, was modelled to set this bound. Any real cache is bound to miss some of the

 

time, so it cannot perform any better than one which always hits.

 

Three forms of perfect cache were modelled using realistic assumptions about the

 

cache and external memory speeds (which were 20 MHz and 8 MHz respectively):

 

caches which hold either just instructions, mixed instructions and data, or just data.

 

The results are shown in Table 10.2 on page 280, normalized to the performance of a

 

system with no cache. They show that instructions are the most important values to

 

hold in the cache, but including data values as well can give a further 25% perform-

 

ance increase.

280 Memory Hierarchy

Table 10.2

'Perfect' cache performance.

 

 

 

Cache form

 

Performance

 

 

 

No cache

 

1

Instruction-only cache

1.95

Instruction and data cache

2.5

Data-only cache

1.13

 

 

 

Although a decision was taken early on that the cache write strategy would be write-through (principally on the grounds of simplicity), it is still possible for the cache to detect a write miss and load a line of data from the write address. This 'allocate on write miss' strategy was investigated briefly, but proved to offer a negligible benefit in exchange for a significant increase in complexity, so it was rapidly abandoned. The problem was reduced to rinding the best organization, consistent with chip area and power constraints, for a unified instruction and data cache with allocation on a read miss.

Various different cache organizations and sizes were investigated, with the results show in Figure 10.6. The simplest cache organization is the direct-mapped cache, but even with a size of 16 Kbytes, the cache is significantly worse than the 'perfect' case. The next step up in complexity is the dual set associative cache; now the performance

Figure 10.6 Unified cache performance as a function of size and organization.

Cache design - an example

281

of the 16 Kbyte cache is within a per cent or so of the perfect cache. But at the time of the design of the ARMS (1989) a 16 Kbyte cache required a large chip area, and the 4 Kbyte cache does not perform so well. (The results depend strongly on the program used to generate the address traces used, but these are typical.)

Going to the other extreme, a fully associative cache performs significantly better at the smaller size, delivering the 'perfect' performance on the benchmark program used for the tests. Here the replacement algorithm is random; LRU (least recently used) gives very similar results.

The cache model was then modified to use a quad-word line which is necessary to reduce the area cost of the tag store. This change had minimal effect on the performance.

The fully associative cache requires a large CAM (Content Addressable Memory) tag store which is likely to consume significant power, even with a quad-word line. The power can be reduced a lot by segmenting the CAM into smaller components, but this reduces the associativity. An analysis of the sensitivity of the system performance on the degree of associativity, using a 4 Kbyte cache, is shown in Figure 10.7. This shows the performance of the system for all associativities from fully (256-way) associative down to direct-mapped (1-way). Although the biggest performance increase is in going from direct-mapped to dual-set associative, there are noticeable improvements all the way up to 64-way associativity.

It would therefore appear that a 64-way associative CAM-RAM cache provides the same performance as the fully associative cache while allowing the 256 CAM entries to be split into four sections to save power. The external memory bandwidth requirement of each level of associativity is also shown in Figure 10.7 (relative to an

Figure 10.7 The effect of associativity on performance and bandwidth requirement.

282

ARM600 cache control FSM

Memory Hierarchy

Figure 10.8 ARMS cache organization.

uncached processor), and note how the highest performance corresponds to the lowest external bandwidth requirement. Since each external access costs a lot of energy compared with internal activity, the cache is simultaneously increasing performance and reducing system power requirements.

The organization of the cache is therefore that shown in Figure 10.8. The bottom two bits of the virtual address select a byte within a 32-bit word, the next two bits select a word within a cache line and the next two bits select one of the four 64-entry CAM tag stores. The rest of the virtual address is presented to the selected tag store (the other tag stores are disabled to save power) to check whether the data is in the cache, and the result is either a miss, or a hit together with the address of the data in the cache data RAM.

To illustrate the sort of control logic required to manage a cache, the ARM600 cache control finite state machine is described below. The ARM600 cache borrows its design from the ARM3 described in Section 10.4 on page 279 and it also includes a translation system similar to the scheme described in Section 10.5 on page 283.

The ARM600 operates with two clocks. The fast clock defines the processor cycle time when it is operating from the cache or writing to the write buffer; the memory clock defines the speed when the processor is accessing external memory. The clock

Memory management

283

supplied to the core switches dynamically between these two clock sources, which may be asynchronous with respect to each other. There is no requirement for the memory clock to be a simple subdivision of the fast clock, though if it is the processor can be configured to avoid the synchronization overhead.

Normally the processor runs from the cache using the fast clock. When a cache miss occurs (or a reference is made to uncacheable memory), the processor synchronizes to the memory clock and either performs a single external access or a cache line-fill. Because switching between the clocks incurs an overhead for the synchronization (to reduce the risk of metastability to an acceptable level), the processor checks the next address before deciding whether or not to switch back to the fast clock.

The finite state machine that controls this activity is shown in Figure 10.9 on page 284. Following initialization, the processor enters the Check tag state running from the fast clock. Depending on whether or not the addressed data is found in the cache, the processor can proceed in one of the following ways:

So long as the address is non-sequential, does not fault in the MMU and is either a read found in the cache or a buffered write, the state machine remains in the Check tag state and a data value is returned or written every clock cycle.

When the next address is a sequential read in the same cache line or a sequential buffered write, the state machine moves to the Sequential fast state where the data may be accessed without checking the tag and without activating the MMU. This saves power, and exploits the seq signal from the processor core. Again a data value is read or written every clock cycle.

If the address is not in the cache or is an unbuffered write an external access is required. This begins in the Start external state. Reads from uncacheable memory and unbuffered writes are completed as single memory transactions in the Exter nal state. Cacheable reads perform a quad-word line fetch, after fetching the nec essary translation information if this was not already in the MMU.

Cycles where the processor does not use memory are executed in the Idle state.

At several points in the translation process it may become clear that the access cannot be completed and the Abort state is entered. Uncacheable reads and unbuffered writes may also be aborted by external hardware.

10.5Memory management

Modern computer systems typically have many programs active at the same time. A single processor can, of course, only execute instructions from one program at any instant, but by switching rapidly between the active programs they all appear to be executing at once, at least when viewed on a human timescale.

284

Memory Hierarchy

1 MMU hit

section OK

 

Figure 10.9 ARM600 cache control state machine.

 

The rapid switching is managed by the operating system, so the application pro-

 

grammer can write his or her program as though it owns the whole machine. The

 

mechanism used to support this illusion is described by the term memory manage-

 

ment unit (MMU). There are two principal approaches to memory management,

 

called segmentation and paging.

Segments

The simplest form of memory management allows an application to view its

 

memory as a set of segments, where each segment contains a particular sort of

 

information. For instance, a program may have a code segment containing all its

Memory management

285

instructions, a data segment and a stack segment. Every memory access provides a segment selector and a logical address to the MMU. Each segment has a base address and a limit associated with it. The logical address is an offset from the segment base address, and must be no greater than the limit or an access violation will occur, usually causing an exception. Segments may also have other access controls, for instance the code segment may be read-only and an attempt to write to it will also cause an exception.

The access mechanism for a segmented MMU is illustrated in Figure 10.10.

 

Figure 10.10 Segmented memory management scheme.

 

Segmentation allows a program to have its own private view of memory and to

 

coexist transparently with other programs in the same memory space. It runs into diffi-

 

culty, however, when the coexisting programs vary and the available memory is lim-

 

ited. Since the segments are of variable size, the free memory becomes fragmented

 

over time and a new program may be unable to start, not because there is insufficient

 

free memory, but because the free memory is all in small pieces none of which is big

 

enough to hold a segment of the size required by the new program.

 

The crisis can be alleviated by the operating system moving segments around in

 

memory to coalesce the free memory into one large piece, but this is inefficient, and

 

most processors now incorporate a memory mapping scheme based on fixed-size

 

chunks of memory called pages. Some architectures include segmentation and pag-

 

ing, but many, including the ARM, just support paging without segmentation.

Paging

In a paging memory management scheme both the logical and the physical ad-

 

dress spaces are divided into fixed-size components called pages. A page is usually

 

a few kilobytes in size, but different architectures use different page sizes. The

286

Memory Hierarchy

relationship between the logical and physical pages is stored in page tables, which are held in main memory.

A simple sum shows that storing the translation in a single table requires a very large table: if a page is 4 Kbytes, 20 bits of a 32-bit address must be translated, which requires 220 x 20 bits of data in the table, or a table of at least 2.5 Mbytes. This is an unreasonable overhead to impose on a small system.

Instead, most paging systems use two or more levels of page table. For example, the top ten bits of the address can be used to identify the appropriate second-level page table in the first-level page table directory, and the second ten bits of the address then identify the page table entry which contains the physical page number. This translation scheme is illustrated in Figure 10.11.

Figure 10.11 Paging memory management scheme.

Note that with the particular numbers suggested here, if 32 bits are allocated to each directory and page table entry, the directory and each page table happen to occupy exactly 4 Kbytes each, or exactly one memory page. The minimum overhead for a small system is 4 Kbytes for the page directory plus 4 Kbytes for one page table; this is sufficient to manage up to 4 Mbytes of physical memory. A fully populated 32 gigabyte memory would require 4 Mbytes of page tables, but this overhead is probably acceptable with this much memory to work in.

The ARM MMU, described in Section 11.6 on page 302, uses a slightly different allocation of bits from the one described here (and also supports the single-level translation of larger blocks of memory), but the principle is the same.

Virtual memory One possibility with either memory management scheme is to allow a segment or page to be marked as absent and an exception to be generated whenever it is accessed. Then an operating system which has run out of memory to allocate can transparently move a page or a segment out of main memory into backup store, which for this purpose is usually a hard disk, and mark it as absent. The physical memory can then be allocated to a different use. If the program attempts to access

Memory management

287

an absent page or segment the exception is raised and the operating system can bring the page or segment back into main memory, then allow the program to retry the access.

 

When implemented with the paged memory management scheme, this process is

 

known as demand-paged virtual memory. A program can be written to occupy a

 

virtual memory space that is larger than the available physical memory space in the

 

computer where it is run, since the operating system can wheel bits of program and

 

data in as they are needed. Typical programs work some parts of their code very hard

 

and rarely touch others; leaving the infrequently used routines out on disk will not

 

noticeably affect performance. However, over-exploiting this facility causes the oper-

 

ating system to switch pages in and out of memory at a high rate. This is described as

 

thrashing, and will adversely affect performance.

Restartable

An important requirement in a virtual memory system is that any instruction that

instructions

can cause a memory access fault must leave the processor in a state that allows the

 

operating system to page-in the requested memory and resume the original program

 

as though the fault had not happened. This is often achieved by making all instruc-

 

tions that access memory restartable. The processor must retain enough state to

 

allow the operating system to recover enough of the register values so that, when the

 

page is in main memory, the faulting instruction is retried with identical results to

 

those that would have been obtained had the page been resident at the first attempt.

 

This requirement is usually the most difficult one to satisfy in the design of a pro-

 

cessor while retaining high-performance and minimum hardware redundancy.

Translation

The paging scheme described above gives the programmer complete freedom and

look-aside

transparency in the use of memory, but it would seem that this has been achieved at

buffers

considerable cost in performance since each memory access appears to have

 

incurred an overhead of two additional memory accesses, one to the page directory

 

and one to the page table, before the data itself is accessed.

 

This overhead is usually avoided by implementing a translation look-aside buffer

 

(TLB), which is a cache of recently used page translations. As with instruction and

 

data caches (described in Section 10.3 on page 272), there are organizational options

 

relating to the degree of associativity and the replacement strategy. The line and block

 

sizes usually equate to a single page table entry, and the size of a typical TLB is much

 

smaller than a data cache at around 64 entries. The locality properties of typical pro-

 

grams enable a TLB of this size to achieve a miss rate of a per cent or so. The misses

 

incur the table-walking overhead of two additional memory accesses.

 

The operation of a TLB is illustrated in Figure 10.12 on page 288.

 

When a system incorporates both an MMU and a cache, the cache may operate

Virtual and

either with virtual (pre-MMU) or physical (post-MMU) addresses.

physical caches

A virtual cache has the advantage that the cache access may start immediately the

 

processor produces an address, and, indeed, there is no need to activate the MMU if

288

Memory Hierarchy

31

1211

0

Figure 10.12 The operation of a translation look-aside buffer.

the data is found in the cache. The drawback is that the cache may contain synonyms, which are duplicate copies of the same main memory data item in the cache. Synonyms arise because address translation mechanisms generally allow overlapping translations. If the processor modifies the data item through one address route it is not possible for the cache to update the second copy, leading to inconsistency in the cache.

A physical cache avoids the synonym problem since physical memory addresses are associated with unique data items. However the MMU must now be activated on every cache access, and with some MMU and cache organizations the address translation must be completed by the MMU before the cache access can begin, leading to much longer cache latencies.

A physical cache arrangement that neatly avoids the sequential access cost exploits the fact that a paging MMU only affects the high-order address bits, while the cache is accessed by the low-order address bits. Provided these sets do not overlap, the cache and MMU accesses can proceed in parallel. The physical address from the MMU arrives at the right time to be compared with the physical address tags from the cache, hiding the address translation time behind the cache tag access. This optimization is not applicable to fully associative caches, and only works if the page size used by the MMU is larger than each directly addressed portion of the cache. A 4 Kbyte page, for example, limits a direct-mapped cache to a maximum size of 4 Kbytes, a 2-way set-associative cache to a maximum size of 8 Kbytes, and so on.

In practice both virtual and physical caches are in commercial use, the former relying on software conventions to contain the synonym problem and the latter either exploiting the above optimization or accepting the performance cost.