Добавил:

Andrey Опубликованный материал нарушает ваши авторские права? Сообщите нам.

Вуз:

Санкт-Петербургский государственный электротехнический университет "ЛЭТИ"

Предмет:

Электротехника

Файл:

Furber S.ARM system-on-chip architecture.2000

.pdf

Скачиваний:

Добавлен:

23.08.2013

Размер:

18.35 Mб

Скачать

☆

<<< < Предыдущая 18 19 20 21 22 23 24 25 26 27 28 2930 / 4330 31 32 33 34 35 36 37 38 39 40 41 42 > Следующая >>>

Cache design - an example				279
Table 10.1	Summary of cache organizational options.

Organizational feature			Options
Cache-MMU relationship		Physical cache	Virtual cache
Cache contents		Unified instruction	Separate instruction
		and data cache	and data caches
Associativity		Direct-mapped	Set-associative	Fully associative
		RAM-RAM	RAM-RAM	CAM-RAM
Replacement strategy		Round-robin	Random	LRU
Write strategy		Write-through	Write-through with	Copy-back
			write buffer

10.4Cache design - an example

	The choice of organization for a cache requires the consideration of several factors as
	discussed in Section 10.3 on page 272, including the size of the cache, the degree of
	associativity, the line and block sizes, the replacement algorithm, and the write strat-
	egy. Detailed architectural simulations are required to analyse the effects of these
	choices on the performance of the cache.
The ARMS	The ARMS, designed in 1989, was the first ARM chip to incorporate an on-chip
cache	cache, and detailed studies were carried out into the effects of these parameters on
	performance and bus use. These studies used specially designed hardware to capture
	address traces while running several benchmark programs on an ARM2; software was
	then used to analyse these traces to model the behaviour of the various organizations.
	(Today special hardware is generally unnecessary since desktop machines have suffi-
	cient performance to simulate large enough programs without hardware support.)
	The study started by setting an upper bound on the performance benefit that could
	be expected from a cache. A 'perfect' cache, which always contains the requested
	data, was modelled to set this bound. Any real cache is bound to miss some of the
	time, so it cannot perform any better than one which always hits.
	Three forms of perfect cache were modelled using realistic assumptions about the
	cache and external memory speeds (which were 20 MHz and 8 MHz respectively):
	caches which hold either just instructions, mixed instructions and data, or just data.
	The results are shown in Table 10.2 on page 280, normalized to the performance of a
	system with no cache. They show that instructions are the most important values to
	hold in the cache, but including data values as well can give a further 25% perform-
	ance increase.

280 Memory Hierarchy

Table 10.2	'Perfect' cache performance.

Cache form		Performance

No cache		1
Instruction-only cache		1.95
Instruction and data cache		2.5
Data-only cache		1.13

Although a decision was taken early on that the cache write strategy would be write-through (principally on the grounds of simplicity), it is still possible for the cache to detect a write miss and load a line of data from the write address. This 'allocate on write miss' strategy was investigated briefly, but proved to offer a negligible benefit in exchange for a significant increase in complexity, so it was rapidly abandoned. The problem was reduced to rinding the best organization, consistent with chip area and power constraints, for a unified instruction and data cache with allocation on a read miss.

Various different cache organizations and sizes were investigated, with the results show in Figure 10.6. The simplest cache organization is the direct-mapped cache, but even with a size of 16 Kbytes, the cache is significantly worse than the 'perfect' case. The next step up in complexity is the dual set associative cache; now the performance

Figure 10.6 Unified cache performance as a function of size and organization.

Cache design - an example

281

of the 16 Kbyte cache is within a per cent or so of the perfect cache. But at the time of the design of the ARMS (1989) a 16 Kbyte cache required a large chip area, and the 4 Kbyte cache does not perform so well. (The results depend strongly on the program used to generate the address traces used, but these are typical.)

Going to the other extreme, a fully associative cache performs significantly better at the smaller size, delivering the 'perfect' performance on the benchmark program used for the tests. Here the replacement algorithm is random; LRU (least recently used) gives very similar results.

The cache model was then modified to use a quad-word line which is necessary to reduce the area cost of the tag store. This change had minimal effect on the performance.

The fully associative cache requires a large CAM (Content Addressable Memory) tag store which is likely to consume significant power, even with a quad-word line. The power can be reduced a lot by segmenting the CAM into smaller components, but this reduces the associativity. An analysis of the sensitivity of the system performance on the degree of associativity, using a 4 Kbyte cache, is shown in Figure 10.7. This shows the performance of the system for all associativities from fully (256-way) associative down to direct-mapped (1-way). Although the biggest performance increase is in going from direct-mapped to dual-set associative, there are noticeable improvements all the way up to 64-way associativity.

It would therefore appear that a 64-way associative CAM-RAM cache provides the same performance as the fully associative cache while allowing the 256 CAM entries to be split into four sections to save power. The external memory bandwidth requirement of each level of associativity is also shown in Figure 10.7 (relative to an

Figure 10.7 The effect of associativity on performance and bandwidth requirement.

282

ARM600 cache control FSM

Memory Hierarchy

Figure 10.8 ARMS cache organization.

uncached processor), and note how the highest performance corresponds to the lowest external bandwidth requirement. Since each external access costs a lot of energy compared with internal activity, the cache is simultaneously increasing performance and reducing system power requirements.

The organization of the cache is therefore that shown in Figure 10.8. The bottom two bits of the virtual address select a byte within a 32-bit word, the next two bits select a word within a cache line and the next two bits select one of the four 64-entry CAM tag stores. The rest of the virtual address is presented to the selected tag store (the other tag stores are disabled to save power) to check whether the data is in the cache, and the result is either a miss, or a hit together with the address of the data in the cache data RAM.

To illustrate the sort of control logic required to manage a cache, the ARM600 cache control finite state machine is described below. The ARM600 cache borrows its design from the ARM3 described in Section 10.4 on page 279 and it also includes a translation system similar to the scheme described in Section 10.5 on page 283.

The ARM600 operates with two clocks. The fast clock defines the processor cycle time when it is operating from the cache or writing to the write buffer; the memory clock defines the speed when the processor is accessing external memory. The clock

Memory management

283

supplied to the core switches dynamically between these two clock sources, which may be asynchronous with respect to each other. There is no requirement for the memory clock to be a simple subdivision of the fast clock, though if it is the processor can be configured to avoid the synchronization overhead.

Normally the processor runs from the cache using the fast clock. When a cache miss occurs (or a reference is made to uncacheable memory), the processor synchronizes to the memory clock and either performs a single external access or a cache line-fill. Because switching between the clocks incurs an overhead for the synchronization (to reduce the risk of metastability to an acceptable level), the processor checks the next address before deciding whether or not to switch back to the fast clock.

The finite state machine that controls this activity is shown in Figure 10.9 on page 284. Following initialization, the processor enters the Check tag state running from the fast clock. Depending on whether or not the addressed data is found in the cache, the processor can proceed in one of the following ways:

•So long as the address is non-sequential, does not fault in the MMU and is either a read found in the cache or a buffered write, the state machine remains in the Check tag state and a data value is returned or written every clock cycle.

•When the next address is a sequential read in the same cache line or a sequential buffered write, the state machine moves to the Sequential fast state where the data may be accessed without checking the tag and without activating the MMU. This saves power, and exploits the seq signal from the processor core. Again a data value is read or written every clock cycle.

•If the address is not in the cache or is an unbuffered write an external access is required. This begins in the Start external state. Reads from uncacheable memory and unbuffered writes are completed as single memory transactions in the Exter nal state. Cacheable reads perform a quad-word line fetch, after fetching the nec essary translation information if this was not already in the MMU.

•Cycles where the processor does not use memory are executed in the Idle state.

At several points in the translation process it may become clear that the access cannot be completed and the Abort state is entered. Uncacheable reads and unbuffered writes may also be aborted by external hardware.

10.5Memory management

Modern computer systems typically have many programs active at the same time. A single processor can, of course, only execute instructions from one program at any instant, but by switching rapidly between the active programs they all appear to be executing at once, at least when viewed on a human timescale.

284	Memory Hierarchy

1 MMU hit

section OK

	Figure 10.9 ARM600 cache control state machine.
	The rapid switching is managed by the operating system, so the application pro-
	grammer can write his or her program as though it owns the whole machine. The
	mechanism used to support this illusion is described by the term memory manage-
	ment unit (MMU). There are two principal approaches to memory management,
	called segmentation and paging.
Segments	The simplest form of memory management allows an application to view its
	memory as a set of segments, where each segment contains a particular sort of
	information. For instance, a program may have a code segment containing all its

Memory management

285

instructions, a data segment and a stack segment. Every memory access provides a segment selector and a logical address to the MMU. Each segment has a base address and a limit associated with it. The logical address is an offset from the segment base address, and must be no greater than the limit or an access violation will occur, usually causing an exception. Segments may also have other access controls, for instance the code segment may be read-only and an attempt to write to it will also cause an exception.

The access mechanism for a segmented MMU is illustrated in Figure 10.10.

	Figure 10.10 Segmented memory management scheme.
	Segmentation allows a program to have its own private view of memory and to
	coexist transparently with other programs in the same memory space. It runs into diffi-
	culty, however, when the coexisting programs vary and the available memory is lim-
	ited. Since the segments are of variable size, the free memory becomes fragmented
	over time and a new program may be unable to start, not because there is insufficient
	free memory, but because the free memory is all in small pieces none of which is big
	enough to hold a segment of the size required by the new program.
	The crisis can be alleviated by the operating system moving segments around in
	memory to coalesce the free memory into one large piece, but this is inefficient, and
	most processors now incorporate a memory mapping scheme based on fixed-size
	chunks of memory called pages. Some architectures include segmentation and pag-
	ing, but many, including the ARM, just support paging without segmentation.
Paging	In a paging memory management scheme both the logical and the physical ad-
	dress spaces are divided into fixed-size components called pages. A page is usually
	a few kilobytes in size, but different architectures use different page sizes. The

286	Memory Hierarchy

relationship between the logical and physical pages is stored in page tables, which are held in main memory.

A simple sum shows that storing the translation in a single table requires a very large table: if a page is 4 Kbytes, 20 bits of a 32-bit address must be translated, which requires 220 x 20 bits of data in the table, or a table of at least 2.5 Mbytes. This is an unreasonable overhead to impose on a small system.

Instead, most paging systems use two or more levels of page table. For example, the top ten bits of the address can be used to identify the appropriate second-level page table in the first-level page table directory, and the second ten bits of the address then identify the page table entry which contains the physical page number. This translation scheme is illustrated in Figure 10.11.

Figure 10.11 Paging memory management scheme.

Note that with the particular numbers suggested here, if 32 bits are allocated to each directory and page table entry, the directory and each page table happen to occupy exactly 4 Kbytes each, or exactly one memory page. The minimum overhead for a small system is 4 Kbytes for the page directory plus 4 Kbytes for one page table; this is sufficient to manage up to 4 Mbytes of physical memory. A fully populated 32 gigabyte memory would require 4 Mbytes of page tables, but this overhead is probably acceptable with this much memory to work in.

The ARM MMU, described in Section 11.6 on page 302, uses a slightly different allocation of bits from the one described here (and also supports the single-level translation of larger blocks of memory), but the principle is the same.

Virtual memory One possibility with either memory management scheme is to allow a segment or page to be marked as absent and an exception to be generated whenever it is accessed. Then an operating system which has run out of memory to allocate can transparently move a page or a segment out of main memory into backup store, which for this purpose is usually a hard disk, and mark it as absent. The physical memory can then be allocated to a different use. If the program attempts to access

Memory management

287

an absent page or segment the exception is raised and the operating system can bring the page or segment back into main memory, then allow the program to retry the access.

	When implemented with the paged memory management scheme, this process is
	known as demand-paged virtual memory. A program can be written to occupy a
	virtual memory space that is larger than the available physical memory space in the
	computer where it is run, since the operating system can wheel bits of program and
	data in as they are needed. Typical programs work some parts of their code very hard
	and rarely touch others; leaving the infrequently used routines out on disk will not
	noticeably affect performance. However, over-exploiting this facility causes the oper-
	ating system to switch pages in and out of memory at a high rate. This is described as
	thrashing, and will adversely affect performance.
Restartable	An important requirement in a virtual memory system is that any instruction that
instructions	can cause a memory access fault must leave the processor in a state that allows the
	operating system to page-in the requested memory and resume the original program
	as though the fault had not happened. This is often achieved by making all instruc-
	tions that access memory restartable. The processor must retain enough state to
	allow the operating system to recover enough of the register values so that, when the
	page is in main memory, the faulting instruction is retried with identical results to
	those that would have been obtained had the page been resident at the first attempt.
	This requirement is usually the most difficult one to satisfy in the design of a pro-
	cessor while retaining high-performance and minimum hardware redundancy.
Translation	The paging scheme described above gives the programmer complete freedom and
look-aside	transparency in the use of memory, but it would seem that this has been achieved at
buffers	considerable cost in performance since each memory access appears to have
	incurred an overhead of two additional memory accesses, one to the page directory
	and one to the page table, before the data itself is accessed.
	This overhead is usually avoided by implementing a translation look-aside buffer
	(TLB), which is a cache of recently used page translations. As with instruction and
	data caches (described in Section 10.3 on page 272), there are organizational options
	relating to the degree of associativity and the replacement strategy. The line and block
	sizes usually equate to a single page table entry, and the size of a typical TLB is much
	smaller than a data cache at around 64 entries. The locality properties of typical pro-
	grams enable a TLB of this size to achieve a miss rate of a per cent or so. The misses
	incur the table-walking overhead of two additional memory accesses.
	The operation of a TLB is illustrated in Figure 10.12 on page 288.
	When a system incorporates both an MMU and a cache, the cache may operate
Virtual and	either with virtual (pre-MMU) or physical (post-MMU) addresses.
physical caches	A virtual cache has the advantage that the cache access may start immediately the
	processor produces an address, and, indeed, there is no need to activate the MMU if

288	Memory Hierarchy

1211

Figure 10.12 The operation of a translation look-aside buffer.

the data is found in the cache. The drawback is that the cache may contain synonyms, which are duplicate copies of the same main memory data item in the cache. Synonyms arise because address translation mechanisms generally allow overlapping translations. If the processor modifies the data item through one address route it is not possible for the cache to update the second copy, leading to inconsistency in the cache.

A physical cache avoids the synonym problem since physical memory addresses are associated with unique data items. However the MMU must now be activated on every cache access, and with some MMU and cache organizations the address translation must be completed by the MMU before the cache access can begin, leading to much longer cache latencies.

A physical cache arrangement that neatly avoids the sequential access cost exploits the fact that a paging MMU only affects the high-order address bits, while the cache is accessed by the low-order address bits. Provided these sets do not overlap, the cache and MMU accesses can proceed in parallel. The physical address from the MMU arrives at the right time to be compared with the physical address tags from the cache, hiding the address translation time behind the cache tag access. This optimization is not applicable to fully associative caches, and only works if the page size used by the MMU is larger than each directly addressed portion of the cache. A 4 Kbyte page, for example, limits a direct-mapped cache to a maximum size of 4 Kbytes, a 2-way set-associative cache to a maximum size of 8 Kbytes, and so on.

In practice both virtual and physical caches are in commercial use, the former relying on software conventions to contain the synonym problem and the latter either exploiting the above optimization or accepting the performance cost.

<<< < Предыдущая 18 19 20 21 22 23 24 25 26 27 28 2930 / 4330 31 32 33 34 35 36 37 38 39 40 41 42 > Следующая >>>

Соседние файлы в предмете Электротехника

#
23.08.2013177.78 Кб21Fuller J.P.MSW Logo.A simplified reference.1998.pdf
#
23.08.2013575.81 Кб9funct_a_l.pdf
#
23.08.2013252.09 Кб7funct_m_q.pdf
#
23.08.2013467.58 Кб7funct_r_z.pdf
#
23.08.2013223.68 Кб11Fung R.Improving accuracy of ADCs.pdf
#
23.08.201318.35 Mб93Furber S.ARM system-on-chip architecture.2000.pdf
#
23.08.20131.01 Mб9Gale T.GTK tutorial.V1.2.2000.pdf
#
23.08.2013732.38 Кб41Gauld A.Learning to program (Python).pdf
#
23.08.20131.34 Mб23Gauld A.Learning to program (Python)_1.pdf
#
23.08.201350.44 Кб13Gay D.M.Correctly rounded binary-decimal and decimal-binary conversions.pdf
#
23.08.20131.18 Mб105General electric VAT200.pdf