- •V. Ya. Krakovsky, m. B. Fesenko
- •In Computer Systems and Networks
- •Contents
- •Preface
- •Introduction
- •Module I. Basic Components of Digital Computers
- •1. The Structure of a Digital Computer
- •1.1. Introduction to Digital Computers
- •Questions for Self-Testing
- •1.2. The Computer Work Stages Implementation Sequence
- •Questions for Self-Testing
- •1.3. Register Gating and Timing of Data Transfers
- •Questions for Self-Testing
- •1.4. Computer Interface Organization
- •Questions for Self-Testing
- •1.5. Computer Control Organization
- •Questions for Self-Testing
- •1.6. Function and Construction of Computer Memory
- •Questions for Self-Testing
- •1.7. Architecturally-Structural Memory Organization Features
- •Questions for Self-Testing
- •2. Data processing fundamentals in digital computers
- •2.1. Element Base Development Influence on Data Processing
- •Questions for Self-Testing
- •2.2. Computer Arithmetic
- •Questions for Self-Testing
- •2.3. Operands Multiplication Operation
- •Questions for Self-Testing
- •2.4. Integer Division
- •Questions for Self-Testing
- •2.5. Floating-Point Numbers and Operations
- •Questions for Self-Testing
- •Questions for Self-Testing on Module I
- •Problems for Self-Testing on Module I
- •Module II. Digital computer organization
- •3. Processors, Memory, and the Evolution System of Instructions
- •3.1. Cisc and risc Microprocessors
- •Questions for Self-Testing
- •3.2. Pipelining
- •Questions for Self-Testing
- •3.3. Interrupts
- •Questions for Self-Testing
- •3.4. Superscalar Processing
- •Questions for Self-Testing
- •3.5. Designing Instruction Formats
- •Questions for Self-Testing
- •3.6. Building a Stack Frame
- •Questions for Self-Testing
- •4. The Structures of Digital Computers
- •4.1. Microprocessors, Microcontrollers, and Systems
- •Questions for Self-Testing
- •4.2. Stack Computers
- •Questions for Self-Testing
- •Questions for Self-Testing
- •4.4. Features of Organization Structure of the Pentium Processors
- •Questions for Self-Testing
- •4.5. Computers Systems on a Chip
- •Multicore Microprocessors.
- •Questions for Self-Testing
- •4.6. Principles of Constructing Reconfigurable Computing Systems
- •Questions for Self-Testing
- •4.7. Types of Digital Computers
- •Questions for Self-Testing
- •Questions for Self-Testing on Module II
- •Problems for Self-Testing on Module II
- •Module III. Parallelism and Scalability
- •5. Super Scalar Processors
- •5.1. The sparc Architecture
- •Questions for Self-Testing
- •5.2. Sparc Addressing Modes and Instruction Set
- •Questions for Self-Testing
- •5.3. Floating-Point on the sparc
- •Questions for Self-Testing
- •5.4. The sparc Computers Family
- •Questions for Self-Testing
- •6. Cluster Superscalar Processors
- •6.1. The Power Architecture
- •Questions for Self-Testing
- •6.2. Multithreading
- •Questions for Self-Testing
- •6.3. Power Microprocessors
- •Questions for Self-Testing
- •6.4. Microarchitecture Level Power-Performance Fundamentals
- •Questions for Self-Testing
- •6.5. The Design Space of Register Renaming Techniques
- •Questions for Self-Testing
- •Questions for Self-Testing on Module III
- •Problems for Self-Testing on Module III
- •Module IV. Explicitly Parallel Instruction Computing
- •7. The itanium processors
- •7.1. Parallel Instruction Computing and Instruction Level Parallelism
- •Questions for Self-Testing
- •7.2. Predication
- •Questions for Self-Testing
- •Questions for Self-Testing
- •7.4. The Itanium Processor Microarchitecture
- •Questions for Self-Testing
- •7.5. Deep Pipelining (10 stages)
- •Questions for Self-Testing
- •7.6. Efficient Instruction and Operand Delivery
- •Instruction bundles capable of full-bandwidth dispersal
- •Questions for Self-Testing
- •7.7. High ilp Execution Core
- •Questions for Self-Testing
- •7.8. The Itanium Organization
- •Implementation of cache hints
- •Questions for Self-Testing
- •7.9. Instruction-Level Parallelism
- •Questions for Self-Testing
- •7.10. Global Code Scheduler and Register Allocation
- •Questions for Self-Testing
- •8. Digital computers on the basic of vliw
- •Questions for Self-Testing
- •8.2. Synthesis of Parallelism and Scalability
- •Questions for Self-Testing
- •8.3. The majc Architecture
- •Questions for Self-Testing
- •8.4. Scit – Ukrainian Supercomputer Project
- •Questions for Self-Testing
- •8.5. Components of Cluster Supercomputer Architecture
- •Questions for Self-Testing
- •Questions for Self-Testing on Module IV
- •Problems for Self-Testing on Module IV
- •Conclusion
- •List of literature
- •Index and Used Abbreviations
- •03680. Київ-680, проспект Космонавта Комарова, 1.
Questions for Self-Testing
1. How is the wavefront in the scheduling region determined?
2. What does "compensation” mean?
3. Explain the organization of the register system in the IA-64.
4. How is the deferred compensation code generation done?
5. Where is the global register allocation performed?
6. What are rotating registers used for?
7. What can be obtained by the speculation support in IA-64?
8. What is the essence of a cost-benefit analysis?
9. Explain the mechanism of global register allocation.
10. Where is global register allocation done?
11. What actions are used for wavefront scheduling?
12. Why is a bit vector used in IA-64 instead of a bit matrix with predicates?
13. What does the processor matrix mean?
14. What parameters are taken into account in a cost-benefit analysis?
15. What is the difference between Speculation and Prediction?
16. What results can be achieved by Speculation?
17. What analysis is used in doing GCS?
18. What parameters are taken into account in doing GCS?
19. How is an embedded loop represented in doing GCS?
20. How is the general analysis for the presence of a controlling flow within the planned area supported in GCS?
8. Digital computers on the basic of vliw
8.1. The 16-Way Itanium Server
The chip set supports up to 16 Itanium processors for optimum 16-way performance. In combination with IA-64 Itanium processors, provides a powerful platform solution for the backbone of the Internet and enterprise computing. The server can be partitioned into a maximum of four domains, each constituting an isolated and complete computer system. This feature aids consolidation of smaller servers into a fewer number of larger servers.
Fig. 8.1 shows the block diagram of the Itanium server 16-way configuration. The modular construction is composed of four 4-CPU cells interconnected via a data crossbar chip and address snoop network. The 16-way box can be hard-partitioned into a maximum of four domains by fully or partially disconnecting the crossbar and the address interconnect at natural boundaries [1], [10].
E
ach
cell has one system bus that supports up to four Intel Itanium
microprocessors with power pods, the Itanium server chip set's north
bridge, main memory DIMMs, and four connections to peripheral
component interconnect (PCI) adapters via proprietary
Gigastream-Links (GSLs).
Fig. 8.2 shows the interrelations
among those components. Two of the four microprocessors and their
associated power pods are located on each side of the cell.
Itanium server's distributed, shared-memory architecture provides each of the four cells with a portion of the main memory. Each cell has 32 DIMM sites, half of which are located on an optional memory daughterboard. The chip set supports up to 128 Gbytes of physical address space. As is the nature of a distributed, shared-memory machine, Itanium server's memory has a cache-coherent, nonuniform memory access (NUMA) model.
The four cells share the service processor and the base I/O including the legacy south bridge, also shown in Fig. 8.1. When the server is partitioned into two or more domains, additional PCI, add-on base I/O cards are inserted into designated PCI slots, except for the primary domain, which is serviced by the original base I/O attached to the service processor board. The shared service processor serves all domains simultaneously.
E
ach
PCI adapter has two 64-bit PCI buses
that
are configurable as either two-slot 66-MHz buses or four-slot 33-MHz
buses, as
shown
in Fig. 8.3. All of the PCI
slots
are hot pluggable. Each PCI adapter has two GSL
ports;
both ports may be used concurrently for performance, or
alternatively for
redundancy.
The maximum configuration of an Itanium server system is 128 PCI
slots or 32 PCI buses, with a maximum of 16 PCI adapters. The
resulting aggregate I/O bandwidth is approximately 8 Gbytes/s.
Chip set architecture. Fig. 8.2 shows the chip set components and their interconnections for each cell. The 16-way configuration has four sets of components plus the external data crossbar. The chip set design is optimized for 16-way or 4-cell configurations and employs a snoop-based coherency mechanism for lower snoop latencies. The chip set uses 0.25-micron process technology and operates at a multiple of the system bus clock frequency.
At the heart of the chip set is the system address controller, an ASIC that handles system bus, I/O, and intercell transactions; internal and external coherency control; address routing; and so on. Fig. 8.4 is a high-level block diagram of the system address controller. The system address controller controls the system data controller and transfers data to and from the system bus, main memory, I/O, and other cells. The I/O controller has signal connections to both the system address controller and system data controller. The I/O controller also has four GSLs to the I/Os as well as a Megastream Link to the legacy south bridge and service processor. I/O translation look-aside buffers are integrated in the I/O controller chip and convert a 32-bit address issued by a single-address-cycle PCI device into a full 64-bit address.
There are two memory chip sets with one located on the cell and one on the optional memory daughterboard. Each consists of an intelligent memory address controller and four interleaving memory data controllers. It supports a chip-kill feature as well as a memory scan engine that performs memory initialization and test at power-on and periodic memory patrol and scrubbing.
Cells are interconnected tightly and directly for addresses to form the address network, and via the data crossbar for data. In a two-cell configuration, the data crossbar chip component may be omitted by direct wiring between the two cells.
To effectively reduce the snoop traffic forwarded to the system bus, each cell has a snoop filter (tag SRAM) that keeps track of the cache contents in the four CPUs on the cell. When a coherent memory transaction is issued in one cell, its address is broadcast to all other cells for simultaneous snooping.
The snoop filter is checked for any chance that the inquired address is cached in the cell. If it is a possibility, the address is forwarded to the system bus for snooping, and the result is returned to the requester cell. Otherwise, a snoop miss response is returned instantly as a result of the tag lookup. In either case, the snoop filter is updated by replacing or purging the tag entry associated with the CPU cache line that was loaded with the memory data. On a memory read, the local or remote addressed memory line is always read speculatively, whether or not the line may be cached in a CPU.
The system address controller has numerous address range registers to configure, which present a single flat memory space to the operating system. Similarly, all the PCI buses can be configured – either by the firmware at boot time or dynamically during online reconfiguration – to make a single, system-wide PCI bus tree. For compatibility reasons, these configuration registers are mapped to the chip set configuration space in a manner similar to Intel's 82460GX chip set. This makes Itanium server a natural 16-way extension of an 82460GX-based 4-way system.
From an operating system's viewpoint, our 16-way platform appears as a collection of 16 CPUs on a single virtual system bus, which is also connected to a single large memory and a large PCI bus tree rooted at the system bus. Although there are certain practical compromises such as limiting external task priority register (XTPR)-based interrupt rerouting within each physical system bus, Itanium server's simple system image and near-uniform memory access latency make it easy to achieve very good scalability without elaborate programming.
The chip set architecture supports compatibility with the Itanium processor and features aimed at reliability, availability, and serviceability. These features include cell hot-plug capability and memory mirroring, data paths protected by error-correcting codes, error containment and graceful error propagation for IA-64 machine check abort recovery, parity-protected control logic, hardware consistency checking, detailed log registers, and backdoor access to the chip set.
Partitioning and In-Box Clustering. When Itanium server is hard-partitioned at cell boundaries creating isolated domains (four maximum), each domain constitutes a complete computer system and runs a separate operating system instance (Fig. 8.5). Each domain can contain an arbitrary number of cells, and each cell may have any number of CPUs.
The integrated service processor configures the domains by programming the chip set registers. Repartitioning may take place either dynamically at user requests, at failures, or at boot time. Needless to say, each domain can be separately booted or brought down without affecting operations of other domains. Although, to alter domain configuration when an operating system is running requires operating system support. In addition to domain partitioning, the Itanium server chip set supports a true in-box clustering capability; that is, the partitioned domains can communicate with one another through the crossbar, eliminating the need for external interconnects. The internal interconnect is actually made up of partially shared physical memory, custom drivers, and optional user-mode libraries.
E
ach
participating domain contributes a fraction of its physical memory
at boot time to share among the nodes as a communication area.
Fig.
8.6 shows the conceptual view of the inbox cluster memory in a 4×4
configuration. Users can program the amount of shared memory, as
well as node configuretions in the field.
Custom drivers and user-mode libraries provide standard cluster application program interfaces to the upper layers and use the partially shared memory as the physical medium. This offers main memory latency and bandwidth for cluster node communications while ensuring application compatibility.
