- •V. Ya. Krakovsky, m. B. Fesenko
- •In Computer Systems and Networks
- •Contents
- •Preface
- •Introduction
- •Module I. Basic Components of Digital Computers
- •1. The Structure of a Digital Computer
- •1.1. Introduction to Digital Computers
- •Questions for Self-Testing
- •1.2. The Computer Work Stages Implementation Sequence
- •Questions for Self-Testing
- •1.3. Register Gating and Timing of Data Transfers
- •Questions for Self-Testing
- •1.4. Computer Interface Organization
- •Questions for Self-Testing
- •1.5. Computer Control Organization
- •Questions for Self-Testing
- •1.6. Function and Construction of Computer Memory
- •Questions for Self-Testing
- •1.7. Architecturally-Structural Memory Organization Features
- •Questions for Self-Testing
- •2. Data processing fundamentals in digital computers
- •2.1. Element Base Development Influence on Data Processing
- •Questions for Self-Testing
- •2.2. Computer Arithmetic
- •Questions for Self-Testing
- •2.3. Operands Multiplication Operation
- •Questions for Self-Testing
- •2.4. Integer Division
- •Questions for Self-Testing
- •2.5. Floating-Point Numbers and Operations
- •Questions for Self-Testing
- •Questions for Self-Testing on Module I
- •Problems for Self-Testing on Module I
- •Module II. Digital computer organization
- •3. Processors, Memory, and the Evolution System of Instructions
- •3.1. Cisc and risc Microprocessors
- •Questions for Self-Testing
- •3.2. Pipelining
- •Questions for Self-Testing
- •3.3. Interrupts
- •Questions for Self-Testing
- •3.4. Superscalar Processing
- •Questions for Self-Testing
- •3.5. Designing Instruction Formats
- •Questions for Self-Testing
- •3.6. Building a Stack Frame
- •Questions for Self-Testing
- •4. The Structures of Digital Computers
- •4.1. Microprocessors, Microcontrollers, and Systems
- •Questions for Self-Testing
- •4.2. Stack Computers
- •Questions for Self-Testing
- •Questions for Self-Testing
- •4.4. Features of Organization Structure of the Pentium Processors
- •Questions for Self-Testing
- •4.5. Computers Systems on a Chip
- •Multicore Microprocessors.
- •Questions for Self-Testing
- •4.6. Principles of Constructing Reconfigurable Computing Systems
- •Questions for Self-Testing
- •4.7. Types of Digital Computers
- •Questions for Self-Testing
- •Questions for Self-Testing on Module II
- •Problems for Self-Testing on Module II
- •Module III. Parallelism and Scalability
- •5. Super Scalar Processors
- •5.1. The sparc Architecture
- •Questions for Self-Testing
- •5.2. Sparc Addressing Modes and Instruction Set
- •Questions for Self-Testing
- •5.3. Floating-Point on the sparc
- •Questions for Self-Testing
- •5.4. The sparc Computers Family
- •Questions for Self-Testing
- •6. Cluster Superscalar Processors
- •6.1. The Power Architecture
- •Questions for Self-Testing
- •6.2. Multithreading
- •Questions for Self-Testing
- •6.3. Power Microprocessors
- •Questions for Self-Testing
- •6.4. Microarchitecture Level Power-Performance Fundamentals
- •Questions for Self-Testing
- •6.5. The Design Space of Register Renaming Techniques
- •Questions for Self-Testing
- •Questions for Self-Testing on Module III
- •Problems for Self-Testing on Module III
- •Module IV. Explicitly Parallel Instruction Computing
- •7. The itanium processors
- •7.1. Parallel Instruction Computing and Instruction Level Parallelism
- •Questions for Self-Testing
- •7.2. Predication
- •Questions for Self-Testing
- •Questions for Self-Testing
- •7.4. The Itanium Processor Microarchitecture
- •Questions for Self-Testing
- •7.5. Deep Pipelining (10 stages)
- •Questions for Self-Testing
- •7.6. Efficient Instruction and Operand Delivery
- •Instruction bundles capable of full-bandwidth dispersal
- •Questions for Self-Testing
- •7.7. High ilp Execution Core
- •Questions for Self-Testing
- •7.8. The Itanium Organization
- •Implementation of cache hints
- •Questions for Self-Testing
- •7.9. Instruction-Level Parallelism
- •Questions for Self-Testing
- •7.10. Global Code Scheduler and Register Allocation
- •Questions for Self-Testing
- •8. Digital computers on the basic of vliw
- •Questions for Self-Testing
- •8.2. Synthesis of Parallelism and Scalability
- •Questions for Self-Testing
- •8.3. The majc Architecture
- •Questions for Self-Testing
- •8.4. Scit – Ukrainian Supercomputer Project
- •Questions for Self-Testing
- •8.5. Components of Cluster Supercomputer Architecture
- •Questions for Self-Testing
- •Questions for Self-Testing on Module IV
- •Problems for Self-Testing on Module IV
- •Conclusion
- •List of literature
- •Index and Used Abbreviations
- •03680. Київ-680, проспект Космонавта Комарова, 1.
Questions for Self-Testing
1. What are peculiarities of the 5-level organization?
2. What are the parameters of the main memory quality?
3. How is the problem of main memory speed increasing solved?
4. How is the problem of main memory size increasing related to virtual memory organization?
5. What is the main idea of the dynamic memory?
6. What factors does the main memory chip selection depend on?
7. What peculiarities of chip static memory realization do you know?
8. How is the chip dynamic memory realized?
9. What are the operations in the memory access cycle?
10. What is the main purpose of memory refresh?
1.7. Architecturally-Structural Memory Organization Features
There are the following varieties of architecturally-structural memory organization:
With general (Symmetrical MultiProcessing - SMP) or up-diffused memory (Massively Parallel Processor - МРР);
General (divided) conventional memory to a few processors incorporated by the unibus of Uniform Memory Architecture - UMA;
With physically up-diffused, but logically by public storage (Non-Uniform Memory Architecture - NUMA);
Cache-Only Memory Architecture - COMA;
Technology integration of creation on the memory crystal as DRAM and technologies of creation thereon crystal of logical charts or processors up-diffused on all array of memory - Processor-In-Memory (PIM) etc.
Multiple-Module Memories and Interleaving. If main memory is structured as collection of physically separate modules, each with its own Address Buffer Register (ABR) and Data Buffer Register (DBR), it is possible for more than one module to be performing Read or Write operations at any given time. The average rate of transmission of words to and from the total RAM system can thus be increased. Extra controls will be required, but since the RAM speed is often the bottleneck in computation speeds, the expense involved is usually justified in large computers.
The way in which individual addresses are distributed over the modules is a critical factor in determining the average number of modules that can be kept busy as computations proceed. Two methods of address layout are indicated in Fig. 1.28. In the first case, the RAM address generated by the CPU is decoded as shown in Fig. 1.28,a. The high-order k bits of the address name one of n modules, and the low-order m bits name a particular word in that module. If the CPU issues Read requests to consecutive locations, as it does when fetching instructions of a straight-line program, then only one module is kept busy by the CPU. However, devices with direct memory access (DMA) ability may be operating in other memory modules.
T
he
second and more effective way of addressing the modules is shown in
Fig. 1.28,
b.
It is called memory
interleaving.
A module is selected by the low-order k
bits of the RAM address, and the high-order m
bits name a location within that module. Therefore, consecutive
addresses are located in successive modules. Thus, any component of
the system, for example, the CPU or DMA device, which generates
requests for access to consecutive RAM locations, can keep a number
of modules busy at any one time. This results in a higher average
utilization of the memory system as a whole.
In the system of Fig. 1.28,b, there must be 2k modules; otherwise, there will be gaps of nonexistent locations in the RAM address space. This raises a practical issue. The first system described, Fig. 1.28,a, is more flexible than the second in that any number of modules up to 2k can be used. The modules are normally assigned consecutive Multiple-Module (MM) addresses from 0 up. Hence, an existing system can be expanded by simply adding one or more modules as required. The second system must always have the full set of 2k modules, and a failure in any module affects all areas of the address space. A failed module in the first system affects only a localized area of the address space. To take advantage of an interleaved RAM unit, the CPU should be able to issue the requests for memory words before these words are actually needed for execution.
Cache Memories. Analysis of a large number of typical programs has shown that most of their execution time is spent in a few main routines. When execution is localized within these routines, a number of instructions are executed repeatedly. This may be in the form of a simple loop, nested loops, or a few procedures that repeatedly call each other. The actual detailed pattern of instruction sequencing is not important. The main observation is that many instructions in each of a few localized areas of the program are repeatedly executed, while the remainder of the program is accessed relatively infrequently. This phenomenon is referred to as locality of reference.
Now, if it can be arranged to have the active segments of a program in a fast memory, then the total execution time can be significantly reduced. Such a memory is referred to as a cache (or buffer) memory. It is inserted between the CPU and the RAM, as shown in Fig. 1.29. To make this arrangement effective, the cache must be considerably faster than the RAM. Their relative access times usually differ by a factor of 5 to 10. This approach is more economical than the use of fast memory devices to implement the entire RAM.
C
onceptually,
operation of a cache memory is very simple. The memory control
circuitry is designed to take advantage of the property of locality
of reference. When a Read request is received from the CPU, the
contents of a block memory words containing the location specified
are transferred into the cache one word at a time. When any of the
locations in this block is referenced by the program, its contents
are read directly from the cache. Usually, the cache memory can store
a number of such blocks at any given time. The correspondence between
the RAM blocks and those in the cache is specified by means of a
mapping
function.
When the cache is full and a memory word (instruction or data) is
referenced that is not in the cache, a decision must be made as to
which block should be removed to create space for the new block that
contains the referenced word. The collection of rules for making this
decision constitutes the replacement
algorithm.
In each of the techniques that we will describe, there are some basic assumptions and operations that are independent on the particular mapping function and replacement algorithm used. It is best to describe them first. The CPU does not need to know explicitly about the existence of the cache. The CPU simply makes Read and Write requests as described previously. The addresses generated by the CPU always refer to locations in the RAM. The memory-access control circuitry shown in Fig. 1.29 determines whether or not the requested word currently exists in the cache. If it does, the Read or Write operation is performed on the appropriate cache location. When the operation is a Read, the main memory is not involved. However, if the operation is a Write, there are two ways that the system can precede. In the first case, the cache location and the RAM location are updated simultaneously. This is called the store-through method. The alternative is to update the cache location only and to mark it as such through the use of an associated flag bit. Later, when the block containing this marked word is to be removed from the cache to make way for a new block, the permanent RAM location of the word is updated. The store-through method is clearly simpler, but it results in unnecessary Write operations in the RAM when a given cache word is updated a number of times during its cache residency period.
Next, consider the case where the addressed word is not in the cache and the operation is a Read. If this happens, the block of words in the RAM that contains the requested word is brought into the cache, and then the particular word requested is forwarded to the CPU. There is an opportunity for some time saving here if the word is forwarded to the CPU as soon as it is available from the RAM instead of waiting for the whole block to be loaded into the cache. This is called load-through. During a Write operation, if the addressed word is not in the cache, the information is written directly into the RAM. In this case, there is a little advantage in transferring the block containing the addressed word to the cache. A Write operation normally refers to a location in one of the data areas of a program rather than to the memory area containing the program instructions. The property of locality of reference is not as pronounced in accessing data when Write operations are involved. Finally, we should recall that in the case of an interleaved memory, contiguous block transfers are very efficient. Thus, transferring data in blocks from the RAM to the cache enables an interleaved RAM unit to operate at its maximum possible speed.
Mapping Functions. In order to discuss possible methods for specifying where RAM blocks are placed in the cache, it is helpful to use a specific example. Consider a cache of 2048 (2K) words with a block size of 16 words. This means that the cache is organized as 128 blocks. Let the RAM have 64K words, addressable by a 16-bit address. For mapping purposes, the memory will be considered as composed of 4K blocks of 16 words each.
The
simplest way of associating RAM blocks with cache blocks is the
direct-mapping
technique. In this technique, block k
of the RAM maps onto block k
modulo 128 of the cache. This is depicted in Fig. 1.30. Since more
than one RAM block is mapped onto a given cache block position,
contention may arise for that position. This situation occurs even
when the cache is not full. C
ontention
is resolved by allowing the new block to overwrite the currently
resident block. Thus the replacement algorithm is trivial. The
detailed operation of the direct-mapping technique is as follows. Let
an RAM address consist of three fields, as shown in Fig. 1.30. When a
new block is first brought into the cache, the high-order 5 bits of
its RAM address are stored in five tag bits associated with its
location in the cache. When the CPU generates a memory request, the
7-bit block address determines the corresponding cache block. The tag
field of that block is compared to the tag field of the address. If
they match, the desired word (
specified
by the low-order 4 bits of the address) is in that block of the
cache.
If there is no match, the required word must be accessed in the RAM. The direct-mapping technique is easy to implement, but it is not very flexible. Fig. 1.31 shows a much more flexible mapping method, whereby an RAM block can potentially reside in any cache block position. This is called the associative-mapping technique. In this case, 12 tag bits are required to identify an RAM block when it is resident in the cache. The tag bits of an address received from the CPU must be compared to the tag bits of each block of the cache to see if the desired block is present. Since there is complete freedom in block positioning, a wide range of replacement algorithms is possible. However, it might not be practical to make full use of this freedom, because complex replacement algorithms may be too difficult to implement. The cost of implementation is also adversely affected by the requirement for a 128-way associative search of 12-bit patterns. The final mapping method to be discussed is the most practical. It is intermediate to the above two techniques. Blocks of the cache are grouped into sets, and the mapping allows a block of RAM to reside in any block of a specific set. Hence having a few choices for block placement eases the contention problem of the direct method. At the same time, decreasing the size of the associative search reduces the hardware cost.
An example of this block-set-associative-mapping technique is given in Fig. 1.32 for the case of two blocks per set. The 6-bit set field of the address determines which set of the cache might contain the desired block, as in the direct-mapping method. The tag field of the address must then be associatively compared to the tags of the two blocks of the set to see if a match occurs signifying block presence. This two-way associative search is not difficult to implement. It is clear that four blocks per set would be accommodated by a 5-bit set field, eight blocks per set by a 4-bit set field, etc. The extreme condition of 128 blocks per set requires no set bits and corresponds to the fully associative technique with 12 tag bits. The other extreme of one block per set is the direct-mapping method.
T
here
has been a tacit assumption that both tags and data are in the cache
memory. But it is quite reasonable to have the few tag bits in a
separate, even faster memory, especially when associative searches
are required. The result of accessing or searching this faster tag
directory
determines whether or not the desired block is in the cache. If it
is, the block is in the cache position that directly corresponds to
the directory tag position where the match is found.
Another technical detail that should be mentioned, that is in fact independent of the mapping function, is that it is usually necessary to have a Valid bit associated with each block. This bit indicates whether or not the block contains valid data. It does not serve the same function as the bit mentioned earlier that is needed to distinguish whether or not the cache contains an updated version of the RAM block. The Valid bits are all set to 0 when power is initially applied to the system or when the RAM is loaded with new programs and data from mass storage devices. These latter transfers normally bypass the cache and are achieved by an I/O channel or some simpler DMA mechanism. The Valid bit of a particular cache block is set to 1 the first time this block is loaded from the RAM. Once set, a Valid bit stays equal to 1 unless an RAM block is updated by a source, other than CPU, that bypasses the cache. In this case, a check is made if the block is currently in the cache. If it is, the Valid bit is set to 0. The introduction of the Valid bit also means that a slight modification should be made to our earlier discussion of cache accesses. As well as tag matching being a condition, accessing should only proceed if the Valid bit is equal to 1.
Flash Memory Device. Flash memory is a form of non-volatile memory that can be electrically erased and reprogrammed, or Electrically Erasable Programmable Read-Only Memory (EEPROM), which permits record cancellation by applying pulses. Besides, this device does not need a special apparatus for programming, as does its counterpart EPROM. They used for storage information while power is off. A short description of the memory different types is given in Table 1.1.
Table 1.1.
Performance of memory different types
Memory type |
Сategory |
Record cancellation |
Byte altering |
Vola-tility |
Application |
SRAM |
Read/Write |
Electric |
Yes |
Yes |
Cache-memory of the 2nd level |
DRAM |
Read/Write |
Electric |
Yes |
Yes |
Main memory |
ROM |
Only Read |
Impossible |
No |
No |
Large size devices |
PROM |
Only Read |
Impossible |
No |
No |
Small size devices |
EPROM |
Mainly Read |
Ultra-violet rays |
No |
No |
Device simulation |
EEPROM |
Mainly Read |
Electric |
Yes |
No |
Device simulation |
Flash |
Read/Write |
Electric |
No |
No |
Digital cameras |
The most contemporary type of EEPROM is flash-memory. As against EPROM, which is erased by ultra-violet rays, and SRAM, which is erased in bytes, flash-memory is written and erased in units. As any EEPROM, flash-memory may be erased without extracting it from the integrated circuit. Many manufacturers produce small circuit boards with hundreds megabytes of flash-memory. They used for storing images in digital cameras and for other purposes. It is possible that flash-memory will replace disks since their time of access is 100 ns. The basic technical problem is the limited number of their deletions. It is about 10,000 and disks may be used for years regardless of the number of their rewritings.
For example, the computer microJava usually uses Flash-memory, which is connected via PCI as it is shown in Fig. 1.33.
Virtual Memories. Virtual memory is a technique that uses main memory as a “cache” for secondary storage. In any computer system in which the currently active programs and data do not fit into the physical RAM space, secondary storage devices such as magnetic disks or tapes are used to hold the overflow. This space problem was first solved by requiring programmers to explicitly move programs or parts of programs from secondary storage to RAM when they are to be executed. However, the problem of management of the available RAM space is a machine-dependent problem that should not need to be solved by the programmer. The general techniques of automatically moving the required program and data blocks into the physical RAM for execution are called virtual-memory techniques. Programs, and hence the CPU, reference an instruction and data space that is independent of the physical RAM space. The binary addresses that the CPU issues for either instructions or data are called virtual or logical addresses. The mechanism that operates on these virtual addresses and translates them into actual locations in the physical hierarchy is usually implemented by a combination of hardware and software components. If the result of translating (or mapping) a specific virtual address is a physical RAM location, the contents of that location are used immediately as required. On the other hand, if the location is not in the RAM its contents must be brought into a suitable location in the RAM and then used.
The simplest form of translation method is based on the assumption that all programs and data are composed of fixed-length pages. These pages are basic units of word blocks that must always occupy contiguous locations, whether they are resident in the RAM or in secondary storage. Pages are commonly 512 or 1024 words long. They constitute the basic unit of information that is moved back and forth between the RAM and secondary storage whenever the translation mechanism determines that a move is required. This discussion clearly parallels many of the ideas that were introduced in the cache memory section. The cache concept is intended to bridge the speed gap between the CPU and the RAM. Hence the cache control is implemented in the hardware. The virtual-memory idea is primarily meant to bridge the size gap between the RAM and secondary storage. It is usually implemented in part by software techniques. Conceptually, cache techniques and virtual-memory techniques involve very similar ideas. They differ mainly in the details of their implementation.
A
n
address translation method based on the concept of fixed-length pages
is shown schematically in Fig. 1.34. Each virtual address generated
by the CPU, whether it is for an instruction fetch or an operand
fetch/store operation, is interpreted as a page number (high-order
bits) followed by a word number (low-order bits). A page table in the
RAM specifies the location of the pages that are currently in the
RAM. The starting address of the page table is kept in the page table
base register. By adding the page number to the contents of this
register, the address of the corresponding entry in the page table is
obtained. The contents of this location name the block of the RAM
where the requested page currently resides; or if the page is not in
the RAM, the page table entry points to where the page is in the
secondary storage. The control bits indicate whether or not the page
is present in the RAM. They also may contain some past usage
information, etc., for purposes of implementing the page replacement
algorithm.
If the page table is stored in the RAM unit, as assumed above, then two RAM accesses need to be made for every RAM access requested by a program. This degradation in execution speed by a factor of 2 is the price that is paid for the programming convenience of a wider addressable memory space. It is not necessary that the page table be implemented in the RAM unit. The system can operate faster if the page table is stored in a small fast memory.
When a new page is to be brought from secondary storage to the RAM, the page table may provide the details of where this data can be found on a magnetic disk. On the other hand, it may provide an address pointer to a block of words in the RAM where this detailed information is stored. In either case, a long delay is now incurred while the page transfer into the RAM takes place. An I/O channel or DMA operation usually does this. At this point, the CPU may be used to execute another task whose pages are in the RAM.
The general problem of deciding which page is to be removed from a full RAM, when a new page is to be brought in, is just as critical here as in the cache situation. The notion that programs tend to spend most of their time in a few localized areas is also equally applicable. Since RAMs are considerably larger than cache memories, it should be possible to keep relatively larger portions of a program in the RAM. This will reduce the frequency of transfers to and from secondary storage. Concepts like the Least Recently Used (LRU) replacement algorithm can be applied to page replacement. The data that determines the LRU page or set of pages can be kept in the page table control bits. The need to update the LRU information on every RAM reference is another reason why the page table should be implemented in a high-speed memory.
All computer components develop very unevenly. If CPU speed grows very quickly (about 60% per year), memory speed increases only by 7% per year, which means that the transistors of the CPU are idle for the greater part of the operation time. This disagreement between the development of CPU speed and memory speed degrades the communication abilities between the CPU and its environment [53].
