- •V. Ya. Krakovsky, m. B. Fesenko
- •In Computer Systems and Networks
- •Contents
- •Preface
- •Introduction
- •Module I. Basic Components of Digital Computers
- •1. The Structure of a Digital Computer
- •1.1. Introduction to Digital Computers
- •Questions for Self-Testing
- •1.2. The Computer Work Stages Implementation Sequence
- •Questions for Self-Testing
- •1.3. Register Gating and Timing of Data Transfers
- •Questions for Self-Testing
- •1.4. Computer Interface Organization
- •Questions for Self-Testing
- •1.5. Computer Control Organization
- •Questions for Self-Testing
- •1.6. Function and Construction of Computer Memory
- •Questions for Self-Testing
- •1.7. Architecturally-Structural Memory Organization Features
- •Questions for Self-Testing
- •2. Data processing fundamentals in digital computers
- •2.1. Element Base Development Influence on Data Processing
- •Questions for Self-Testing
- •2.2. Computer Arithmetic
- •Questions for Self-Testing
- •2.3. Operands Multiplication Operation
- •Questions for Self-Testing
- •2.4. Integer Division
- •Questions for Self-Testing
- •2.5. Floating-Point Numbers and Operations
- •Questions for Self-Testing
- •Questions for Self-Testing on Module I
- •Problems for Self-Testing on Module I
- •Module II. Digital computer organization
- •3. Processors, Memory, and the Evolution System of Instructions
- •3.1. Cisc and risc Microprocessors
- •Questions for Self-Testing
- •3.2. Pipelining
- •Questions for Self-Testing
- •3.3. Interrupts
- •Questions for Self-Testing
- •3.4. Superscalar Processing
- •Questions for Self-Testing
- •3.5. Designing Instruction Formats
- •Questions for Self-Testing
- •3.6. Building a Stack Frame
- •Questions for Self-Testing
- •4. The Structures of Digital Computers
- •4.1. Microprocessors, Microcontrollers, and Systems
- •Questions for Self-Testing
- •4.2. Stack Computers
- •Questions for Self-Testing
- •Questions for Self-Testing
- •4.4. Features of Organization Structure of the Pentium Processors
- •Questions for Self-Testing
- •4.5. Computers Systems on a Chip
- •Multicore Microprocessors.
- •Questions for Self-Testing
- •4.6. Principles of Constructing Reconfigurable Computing Systems
- •Questions for Self-Testing
- •4.7. Types of Digital Computers
- •Questions for Self-Testing
- •Questions for Self-Testing on Module II
- •Problems for Self-Testing on Module II
- •Module III. Parallelism and Scalability
- •5. Super Scalar Processors
- •5.1. The sparc Architecture
- •Questions for Self-Testing
- •5.2. Sparc Addressing Modes and Instruction Set
- •Questions for Self-Testing
- •5.3. Floating-Point on the sparc
- •Questions for Self-Testing
- •5.4. The sparc Computers Family
- •Questions for Self-Testing
- •6. Cluster Superscalar Processors
- •6.1. The Power Architecture
- •Questions for Self-Testing
- •6.2. Multithreading
- •Questions for Self-Testing
- •6.3. Power Microprocessors
- •Questions for Self-Testing
- •6.4. Microarchitecture Level Power-Performance Fundamentals
- •Questions for Self-Testing
- •6.5. The Design Space of Register Renaming Techniques
- •Questions for Self-Testing
- •Questions for Self-Testing on Module III
- •Problems for Self-Testing on Module III
- •Module IV. Explicitly Parallel Instruction Computing
- •7. The itanium processors
- •7.1. Parallel Instruction Computing and Instruction Level Parallelism
- •Questions for Self-Testing
- •7.2. Predication
- •Questions for Self-Testing
- •Questions for Self-Testing
- •7.4. The Itanium Processor Microarchitecture
- •Questions for Self-Testing
- •7.5. Deep Pipelining (10 stages)
- •Questions for Self-Testing
- •7.6. Efficient Instruction and Operand Delivery
- •Instruction bundles capable of full-bandwidth dispersal
- •Questions for Self-Testing
- •7.7. High ilp Execution Core
- •Questions for Self-Testing
- •7.8. The Itanium Organization
- •Implementation of cache hints
- •Questions for Self-Testing
- •7.9. Instruction-Level Parallelism
- •Questions for Self-Testing
- •7.10. Global Code Scheduler and Register Allocation
- •Questions for Self-Testing
- •8. Digital computers on the basic of vliw
- •Questions for Self-Testing
- •8.2. Synthesis of Parallelism and Scalability
- •Questions for Self-Testing
- •8.3. The majc Architecture
- •Questions for Self-Testing
- •8.4. Scit – Ukrainian Supercomputer Project
- •Questions for Self-Testing
- •8.5. Components of Cluster Supercomputer Architecture
- •Questions for Self-Testing
- •Questions for Self-Testing on Module IV
- •Problems for Self-Testing on Module IV
- •Conclusion
- •List of literature
- •Index and Used Abbreviations
- •03680. Київ-680, проспект Космонавта Комарова, 1.
Questions for Self-Testing
1. What is the conveyer cycle time of the Itanium processor determined by?
2. What are the peculiarities of the 10-stage core pipeline?
3. What is the contents of the Itanium processor core die?
4. What is the function of the return stack buffer (RSB)?
5. Present the block diagram of the Itanium processor.
6. How is the 3-level cache-memory used in the Itanium processor?
7. Characterize the operation of the 10-stage conveyer in the Itanium processor.
8. What is the BRP used for?
9. What is the initialization of the control look-ahead fetching based on?
10. What mechanism is used by the compiler for fetching BRP instructions?
11. How are the Itanium processor front-ends organized?
7.6. Efficient Instruction and Operand Delivery
After instructions are fetched in the front-end, they move into the middle pipeline that disperses instructions, implements the architectural renaming of registers, and delivers operands to the wide parallel hardware. The hardware resources in the back-end of the machine are organized around nine issue ports. The instruction and operand delivery hardware maps the six incoming instructions onto the nine issue ports and remaps the virtual register identifiers specified in the source code onto physical registers used to access the register file. It then provides the source data to the execution core. The dispersal and renaming hardware exploits high-level semantic information provided by the IA-64 software, efficiently enabling greater ILP and reduced instruction path length.
Explicit parallelism directives. The instruction dispersal mechanism disperses instructions presented by the decoupling buffer to the processor's issue ports. The processor has a total of nine issue ports capable of issuing up to two memory instructions (ports M0 and Ml), two integer (ports I0 and I1), two floating-point (ports F0 and Fl), and three branch instructions (ports B0, Bl, and B2) per clock. The processor's 17 execution units are fed through the M, I, F, and B groups of issue ports.
The decoupling buffer feeds the dispersal in a bundle granular fashion (up to two bundles or six instructions per cycle), with a fresh bundle being presented each time one is consumed. Dispersal from the two bundles is instruction granular – the processor disperses as many instructions as can be issued (up to six) in left-to-right order. The dispersal algorithm is fast and simple, with instructions being dispersed to the first available issue port, subject to two constraints: detection of instruction independence and detection of resource oversubscription.
Independence. The processor must ensure that all instructions issued in parallel are either independent or contain only allowed dependencies (such as a compare instruction feeding a dependent conditional branch). This question is easily dealt with by using the stop-bits feature of the IA-64 ISA to explicitly communicate parallel instruction semantics. Instructions between consecutive stop bits are deemed independent, so the instruction independence detection hardware is trivial. This contrasts with traditional RISC processors that are required to perform O(n2) (typically dozens) comparisons between source and destination register specifier to determine independence.
Oversubscription. The processor must also guarantee that there are sufficient execution resources to process all the instructions that will be issued in parallel. This oversubscription problem is facilitated by the IA-64 ISA feature of instruction bundle templates. Each instruction bundle not only specifies three instructions but also contains a 4-bit template field, indicating the type of each instruction: memory (M), integer (I), branch (B), and so on. By examining template fields from the two bundles (a total of only 8 bits), the dispersal logic can quickly determine the number of memory, integer, floating-point, and branch instructions incoming every clock.
This is a hardware simplification resulting from the IA-64 instruction set architecture. Unlike conventional instruction set architectures, the instruction encoding itself doesn't need to be examined to determine the type of each operation. This feature removes decoders that would otherwise be required to examine many bits of the encoded instruction to determine the instruction's type and associated issue port. A second key advantage of the template-based dispersal strategy is that certain instruction types can only occur on specific locations within any bundle. As a result, the dispersal interconnection network can be significantly optimized; the routing required from dispersal to issue ports is roughly only half of that required for a fully connected crossbar.
Table 7.1 illustrates the effectiveness of the dispersal strategy by enumerating the instruction bundles that may be issued at full bandwidth. As can be seen, a rich mix of instructions can be issued to the machine at high throughput (six per clock). The combination of stop bits and bundle templates, as specified in the IA-64 instruction set, allows the compiler to indicate the independence and instruction-type information directly and effectively to the dispersal hardware. As a result, the hardware is greatly simplified, thereby allowing an efficient implementation of instruction dispersal to a wide execution core.
Table 7.1.
