- •V. Ya. Krakovsky, m. B. Fesenko
- •In Computer Systems and Networks
- •Contents
- •Preface
- •Introduction
- •Module I. Basic Components of Digital Computers
- •1. The Structure of a Digital Computer
- •1.1. Introduction to Digital Computers
- •Questions for Self-Testing
- •1.2. The Computer Work Stages Implementation Sequence
- •Questions for Self-Testing
- •1.3. Register Gating and Timing of Data Transfers
- •Questions for Self-Testing
- •1.4. Computer Interface Organization
- •Questions for Self-Testing
- •1.5. Computer Control Organization
- •Questions for Self-Testing
- •1.6. Function and Construction of Computer Memory
- •Questions for Self-Testing
- •1.7. Architecturally-Structural Memory Organization Features
- •Questions for Self-Testing
- •2. Data processing fundamentals in digital computers
- •2.1. Element Base Development Influence on Data Processing
- •Questions for Self-Testing
- •2.2. Computer Arithmetic
- •Questions for Self-Testing
- •2.3. Operands Multiplication Operation
- •Questions for Self-Testing
- •2.4. Integer Division
- •Questions for Self-Testing
- •2.5. Floating-Point Numbers and Operations
- •Questions for Self-Testing
- •Questions for Self-Testing on Module I
- •Problems for Self-Testing on Module I
- •Module II. Digital computer organization
- •3. Processors, Memory, and the Evolution System of Instructions
- •3.1. Cisc and risc Microprocessors
- •Questions for Self-Testing
- •3.2. Pipelining
- •Questions for Self-Testing
- •3.3. Interrupts
- •Questions for Self-Testing
- •3.4. Superscalar Processing
- •Questions for Self-Testing
- •3.5. Designing Instruction Formats
- •Questions for Self-Testing
- •3.6. Building a Stack Frame
- •Questions for Self-Testing
- •4. The Structures of Digital Computers
- •4.1. Microprocessors, Microcontrollers, and Systems
- •Questions for Self-Testing
- •4.2. Stack Computers
- •Questions for Self-Testing
- •Questions for Self-Testing
- •4.4. Features of Organization Structure of the Pentium Processors
- •Questions for Self-Testing
- •4.5. Computers Systems on a Chip
- •Multicore Microprocessors.
- •Questions for Self-Testing
- •4.6. Principles of Constructing Reconfigurable Computing Systems
- •Questions for Self-Testing
- •4.7. Types of Digital Computers
- •Questions for Self-Testing
- •Questions for Self-Testing on Module II
- •Problems for Self-Testing on Module II
- •Module III. Parallelism and Scalability
- •5. Super Scalar Processors
- •5.1. The sparc Architecture
- •Questions for Self-Testing
- •5.2. Sparc Addressing Modes and Instruction Set
- •Questions for Self-Testing
- •5.3. Floating-Point on the sparc
- •Questions for Self-Testing
- •5.4. The sparc Computers Family
- •Questions for Self-Testing
- •6. Cluster Superscalar Processors
- •6.1. The Power Architecture
- •Questions for Self-Testing
- •6.2. Multithreading
- •Questions for Self-Testing
- •6.3. Power Microprocessors
- •Questions for Self-Testing
- •6.4. Microarchitecture Level Power-Performance Fundamentals
- •Questions for Self-Testing
- •6.5. The Design Space of Register Renaming Techniques
- •Questions for Self-Testing
- •Questions for Self-Testing on Module III
- •Problems for Self-Testing on Module III
- •Module IV. Explicitly Parallel Instruction Computing
- •7. The itanium processors
- •7.1. Parallel Instruction Computing and Instruction Level Parallelism
- •Questions for Self-Testing
- •7.2. Predication
- •Questions for Self-Testing
- •Questions for Self-Testing
- •7.4. The Itanium Processor Microarchitecture
- •Questions for Self-Testing
- •7.5. Deep Pipelining (10 stages)
- •Questions for Self-Testing
- •7.6. Efficient Instruction and Operand Delivery
- •Instruction bundles capable of full-bandwidth dispersal
- •Questions for Self-Testing
- •7.7. High ilp Execution Core
- •Questions for Self-Testing
- •7.8. The Itanium Organization
- •Implementation of cache hints
- •Questions for Self-Testing
- •7.9. Instruction-Level Parallelism
- •Questions for Self-Testing
- •7.10. Global Code Scheduler and Register Allocation
- •Questions for Self-Testing
- •8. Digital computers on the basic of vliw
- •Questions for Self-Testing
- •8.2. Synthesis of Parallelism and Scalability
- •Questions for Self-Testing
- •8.3. The majc Architecture
- •Questions for Self-Testing
- •8.4. Scit – Ukrainian Supercomputer Project
- •Questions for Self-Testing
- •8.5. Components of Cluster Supercomputer Architecture
- •Questions for Self-Testing
- •Questions for Self-Testing on Module IV
- •Problems for Self-Testing on Module IV
- •Conclusion
- •List of literature
- •Index and Used Abbreviations
- •03680. Київ-680, проспект Космонавта Комарова, 1.
Questions for Self-Testing
1. What is the general structure of a cluster supercomputer system?
2. What are the differences between the Xeon and Itanium platforms?
3. What topology is used for building the SCI system network?
4. What peculiarities of computer architecture is the notion of machine intellect connected with?
5. What are the tendencies in the evolution of cluster architectures?
6. What units are used to measure latency?
7. What is the time of exchange latency determined by?
8. How is the throughput of the data transfer channel determined?
9. Does the value of latency depend on the length of a message?
10. What is the dependence between throughput and latency?
8.5. Components of Cluster Supercomputer Architecture
Each of the two clusters is an array of computing nodes connected together by means of three networks. The first one is a system network based on a SCI interface. The second one is a file data network, based on a Gigabit Ethernet interface. The third one is a management network based on a Fast Ethernet interface. A general block-diagram of the SCIT supercomputer is shown in Fig. 8.16.
A local data network is based on SCI and used for high-performance low-latency inter-node communication during a calculation process. A local data network is built as a 2D mesh. For a 16x node cluster it is configured as a 4x4 or a 2x8 2D mesh. For a 32x node cluster it is configured as a 4x8 or a 2x16 2D mesh. For data transfers based on MPI, the throughput for Xeon E7501 platforms is 250 MB/s, for Itanium2 8870 platforms – 355 MB/s.
A local control network is based on Gigabit Ethernet and is used to handle all cluster computing nodes as well as to transfer data files between nodes and a file server. A local monitor network is used for service information transfer and monitoring of the whole cluster system.
The architecture of the multi-server cluster system is a multi-plane combination of hardware-software tools, particularly at the level of interaction of server operating systems, distribution of computational processes to processors and synchronization of these processes, effective maintenance of queries to centralized or distributed files systems. The performance data of SCIT-1 and SCIT-2 systems are described in Table 8.1.
Table 8.1.
64-bit performance parameters of SCIT-1 and SCIT-2 systems
|
SCIT-1 |
SCIT-2 |
|
1. Processors |
P-IV Xeon 2.67 GHz |
Itanium2 1.4 GHz |
|
2. Peak performance of a single processor |
|
|
|
Integer operations per second, 10 9IPS |
1.34 |
5.6 |
|
Floating point operations per second, GFLOPS |
5.34 |
5.6 |
|
Node system bus performance, GB/s |
4.2 |
6.4 |
|
3. Total peak performance of the system |
|
|
|
Integer operations per second, 10 9IPS |
43 |
358 |
|
Floating point operations per second, GFLOPS |
170 |
358 |
|
Total system bus performance, GB/s |
67.2 |
204.8 |
|
4. Linpack performance of the system, GFLOPS |
112.5 |
280 |
|
The performance characteristics of the developed systems SCIT-1 and SCIT-2 are comparable with those of the world’s best systems. They are also among the world’s best systems in mathematical supercomputing construction.
The creation of the SCIT-1 and SCIT-2 cluster systems and their integration and finally launch was possible due to fruitful cooperation of Glushkov Institute of Cybernetics of NAS of Ukraine, the Kyiv-based USTAR scientific and manufacturing company and Intel corporation (International). The partners of the institute delivered the technical support and consulting of the project shown in Fig. 8.17.
The components of the system level software of the cluster support all stages of user-level parallel software development. They also provide execution of processing-intensive problems. They run on all the nodes of the cluster and the control node as well. The operating systems used are Linux. A message Passing Interface (MPI) instead of an SCI is used for programming in the message-passing model. The system level software also includes optimized compilers of C, C++, Fortran languages for parallel programming, fast Math libraries, etc. The powerful hardware, system level, service and specific cluster software integrated in the system is a strong ground for the application level software development.
Choosing Architecture Features for the Supercomputer Project. Based on the above mentioned specific properties of cluster problems, it is possible to state the generalized requirements to the node of a cluster for effective solution of parallel problems:
– the productivity of the node must linearly depend on the power of the processor, and the productivity of the processor – on the frequency of the main memory bus used and the amount of main memory accessible in the node (to some reasonable degree);
– interprocessor data exchange must be faster than interconnect exchange, i.e. it is preferable to use multiprocessor nodes (with 2 - 4 processors) and multicore processors;
– the productivity of a node must depend on the interconnect used; two features important here are latency, (i.e. delay arising up at the transmission of minimum package between nodes) and maximal carrying capacity;
– the productivity of a node must depend on the intensity of input-output operations with storage devices
Pipeline and systems calls. As a rule cyclic algorithms are not used for parallel problems executed by a computing node. Therefore classic architecture with a short pipeline, used in AMD processors, is much preferable than Intel P4 processor architectures. Every reference to the data of a neighboring process is accompanied by a few transitions in the kernel node of the processor. The price for this transition is the drop of processing speed by 120-240 times in AMD processors and 1100-1300 times in Intel P4 processors.
HyperThreading. We remind that HyperThreading is a technology of emulating two (and more) processors in the same processor core. As a rule, only one user application may be run here. However, if at least two processors are run (i.e., fragments of a machine code designed to solve a problem), this is HyperThreading. In HyperThreading, each successive process is run on a virtual processor realized with the emulating processor. In contrast to this technology, multicore technology is that in which processors are not virtual or emulation takes place additionally in several processors. Due to the idle time of one of the pipelines in an incorrectly predicted transition or just impossibility of parallel execution of instructions in the P4 architecture, there is a possibility of the use of idling resources as a virtual processor (HyperThreading), but in parallel problems it results only in falling of productivity. The explanation is simple - data exchange between nodes aligns the productivity of all processes at the slowest speed and, as in a virtual processor there is no more than 40% of a real processor, the general productivity falls 2-3 times, i.e. this possibility for clusters is practically unavailable.
Presently the productivity of the existing cluster systems SCIT is sufficient only for simultaneous execution of a few problems, therefore, to meet the current needs of Institutes of NAS of Ukraine it is necessary to heave up the productivity of supercomputer systems ten-fold or more (up to a few teraflops).
Features of a programmable interface for cluster systems. Local networks in-use such as the multiprocessor computer systems of МРР are named clusters of workstations or just clusters, as well as any special connection of computers incorporated for solving a single task. Local networks specially collected to be used as a multiprocessor computer system, placed compactly in one or a few closets, are named clusters of dedicated workstations. For cluster systems the library of Message Passing Interface (MPI) standard is the basic tool of concurrent programming - interface of messages passing [46, 50, and 53].
A parallel program initially contains codes for all branches, but the loader starts the specified quantity of program copies. Thus, each copy of the program determines the sequence number, and, depending on this number and the general size of the calculation field, executes one or another branch of the algorithm. As each branch has its own information space, fully isolated from other branches, the branches exchange information only by passing messages in the operating concurrent programming environment. Note that the start of an MPI-application is carried out by the control workstation; therefore the started file of the MPI-application must be accessible in every calculating module on that absolute path from the control workstation.
MPI, as a programming tool for providing connections between the branches of parallel applications, provides a single mechanism for cooperation branches in parallel applications regardless of machine architecture (uniprocessor/multiprocessor with general/separate memory) and user software interface. MPI procedures are subdivided into: general procedures providing initiation and completion of processes and also service functions; procedures of reception/message passing; procedures of collective inter-process communication; procedures of processes synchronization; and procedures for working with groups of processes. A process is the execution of a program on one processor, which MPI is set on, irrespective of whether this program contains inwardly parallel branches or I/O operations or just a sequential program code.
The basic difference of MPI standard from its predecessors is the concept of a communicator. The communicator determines the context of message passing. Messages using different communicators do not have influence on each other and do not cooperate. All synchronization and message passing operations are localized in the communicator. A group of processes contacts the communicator. The group is a collection of processes each of which has a unique identifier for cooperation with other processes of the group by means of the group communicator. In particular, all collective operations are called simultaneously on all processes of this group. The group communicator realizes information exchanges between processes and their synchronization. Actually, the communicator is used as a cooperation communication environment for the applied group. Each group of processes uses a separate communicator. Processes in a group usually have successive numbers from 0 to n-1, where n is the number of processes in the group. However, the MPI logic allows entering another numerations for the branches of parallel programs, here each branch can have a different number in a different numeration system proper to different communicators. As communications between processes are encapsulated in the communicator, the libraries of parallel programs may be created on the basis of MPI. MPI library allows each function to have a second name. Therefore, if the function is called by one name already, the second name is temporaryly forgotten. This approach is called the tools of weak symbols.
