Questions for Self-Testing

1. What is the general structure of a cluster supercomputer system?

2. What are the differences between the Xeon and Itanium platforms?

3. What topology is used for building the SCI system network?

4. What peculiarities of computer architecture is the notion of machine intellect connected with?

5. What are the tendencies in the evolution of cluster architectures?

6. What units are used to measure latency?

7. What is the time of exchange latency determined by?

8. How is the throughput of the data transfer channel determined?

9. Does the value of latency depend on the length of a message?

10. What is the dependence between throughput and latency?

8.5. Components of Cluster Supercomputer Architecture

Each of the two clusters is an array of computing nodes connected together by means of three networks. The first one is a system network based on a SCI interface. The second one is a file data network, based on a Gigabit Ethernet interface. The third one is a management network based on a Fast Ethernet interface. A general block-diagram of the SCIT supercomputer is shown in Fig. 8.16.

A local data network is based on SCI and used for high-performance low-latency inter-node communication during a calculation process. A local data network is built as a 2D mesh. For a 16x node cluster it is configured as a 4x4 or a 2x8 2D mesh. For a 32x node cluster it is configured as a 4x8 or a 2x16 2D mesh. For data transfers based on MPI, the throughput for Xeon E7501 platforms is 250 MB/s, for Itanium2 8870 platforms – 355 MB/s.

A local control network is based on Gigabit Ethernet and is used to handle all cluster computing nodes as well as to transfer data files between nodes and a file server. A local monitor network is used for service information transfer and monitoring of the whole cluster system.

The architecture of the multi-server cluster system is a multi-plane combination of hardware-software tools, particularly at the level of interaction of server operating systems, distribution of computational processes to processors and synchronization of these processes, effective maintenance of queries to centralized or distributed files systems. The performance data of SCIT-1 and SCIT-2 systems are described in Table 8.1.

Table 8.1.

64-bit performance parameters of SCIT-1 and SCIT-2 systems

	SCIT-1	SCIT-2
1. Processors	P-IV Xeon 2.67 GHz	Itanium2 1.4 GHz
2. Peak performance of a single processor
Integer operations per second, 10 ⁹IPS	1.34	5.6
Floating point operations per second, GFLOPS	5.34	5.6
Node system bus performance, GB/s	4.2	6.4
3. Total peak performance of the system
Integer operations per second, 10 ⁹IPS	43	358
Floating point operations per second, GFLOPS	170	358
Total system bus performance, GB/s	67.2	204.8
4. Linpack performance of the system, GFLOPS	112.5	280

The performance characteristics of the developed systems SCIT-1 and SCIT-2 are comparable with those of the world’s best systems. They are also among the world’s best systems in mathematical supercomputing construction.

The creation of the SCIT-1 and SCIT-2 cluster systems and their integration and finally launch was possible due to fruitful cooperation of Glushkov Institute of Cybernetics of NAS of Ukraine, the Kyiv-based USTAR scientific and manufacturing company and Intel corporation (International). The partners of the institute delivered the technical support and consulting of the project shown in Fig. 8.17.

The components of the system level software of the cluster support all stages of user-level parallel software development. They also provide execution of processing-intensive problems. They run on all the nodes of the cluster and the control node as well. The operating systems used are Linux. A message Passing Interface (MPI) instead of an SCI is used for programming in the message-passing model. The system level software also includes optimized compilers of C, C++, Fortran languages for parallel programming, fast Math libraries, etc. The powerful hardware, system level, service and specific cluster software integrated in the system is a strong ground for the application level software development.

Choosing Architecture Features for the Supercomputer Project. Based on the above mentioned specific properties of cluster problems, it is possible to state the generalized requirements to the node of a cluster for effective solution of parallel problems:

– the productivity of the node must linearly depend on the power of the processor, and the productivity of the processor – on the frequency of the main memory bus used and the amount of main memory accessible in the node (to some reasonable degree);

– interprocessor data exchange must be faster than interconnect exchange, i.e. it is preferable to use multiprocessor nodes (with 2 - 4 processors) and multicore processors;

– the productivity of a node must depend on the interconnect used; two features important here are latency, (i.e. delay arising up at the transmission of minimum package between nodes) and maximal carrying capacity;

– the productivity of a node must depend on the intensity of input-output operations with storage devices

Pipeline and systems calls. As a rule cyclic algorithms are not used for parallel problems executed by a computing node. Therefore classic architecture with a short pipeline, used in AMD processors, is much preferable than Intel P4 processor architectures. Every reference to the data of a neighboring process is accompanied by a few transitions in the kernel node of the processor. The price for this transition is the drop of processing speed by 120-240 times in AMD processors and 1100-1300 times in Intel P4 processors.

HyperThreading. We remind that HyperThreading is a technology of emulating two (and more) processors in the same processor core. As a rule, only one user application may be run here. However, if at least two processors are run (i.e., fragments of a machine code designed to solve a problem), this is HyperThreading. In HyperThreading, each successive process is run on a virtual processor realized with the emulating processor. In contrast to this technology, multicore technology is that in which processors are not virtual or emulation takes place additionally in several processors. Due to the idle time of one of the pipelines in an incorrectly predicted transition or just impossibility of parallel execution of instructions in the P4 architecture, there is a possibility of the use of idling resources as a virtual processor (HyperThreading), but in parallel problems it results only in falling of productivity. The explanation is simple - data exchange between nodes aligns the productivity of all processes at the slowest speed and, as in a virtual processor there is no more than 40% of a real processor, the general productivity falls 2-3 times, i.e. this possibility for clusters is practically unavailable.

Presently the productivity of the existing cluster systems SCIT is sufficient only for simultaneous execution of a few problems, therefore, to meet the current needs of Institutes of NAS of Ukraine it is necessary to heave up the productivity of supercomputer systems ten-fold or more (up to a few teraflops).

Features of a programmable interface for cluster systems. Local networks in-use such as the multiprocessor computer systems of МРР are named clusters of workstations or just clusters, as well as any special connection of computers incorporated for solving a single task. Local networks specially collected to be used as a multiprocessor computer system, placed compactly in one or a few closets, are named clusters of dedicated workstations. For cluster systems the library of Message Passing Interface (MPI) standard is the basic tool of concurrent programming - interface of messages passing [46, 50, and 53].

A parallel program initially contains codes for all branches, but the loader starts the specified quantity of program copies. Thus, each copy of the program determines the sequence number, and, depending on this number and the general size of the calculation field, executes one or another branch of the algorithm. As each branch has its own information space, fully isolated from other branches, the branches exchange information only by passing messages in the operating concurrent programming environment. Note that the start of an MPI-application is carried out by the control workstation; therefore the started file of the MPI-application must be accessible in every calculating module on that absolute path from the control workstation.

MPI, as a programming tool for providing connections between the branches of parallel applications, provides a single mechanism for cooperation branches in parallel applications regardless of machine architecture (uniprocessor/multiprocessor with general/separate memory) and user software interface. MPI procedures are subdivided into: general procedures providing initiation and completion of processes and also service functions; procedures of reception/message passing; procedures of collective inter-process communication; procedures of processes synchronization; and procedures for working with groups of processes. A process is the execution of a program on one processor, which MPI is set on, irrespective of whether this program contains inwardly parallel branches or I/O operations or just a sequential program code.

The basic difference of MPI standard from its predecessors is the concept of a communicator. The communicator determines the context of message passing. Messages using different communicators do not have influence on each other and do not cooperate. All synchronization and message passing operations are localized in the communicator. A group of processes contacts the communicator. The group is a collection of processes each of which has a unique identifier for cooperation with other processes of the group by means of the group communicator. In particular, all collective operations are called simultaneously on all processes of this group. The group communicator realizes information exchanges between processes and their synchronization. Actually, the communicator is used as a cooperation communication environment for the applied group. Each group of processes uses a separate communicator. Processes in a group usually have successive numbers from 0 to n-1, where n is the number of processes in the group. However, the MPI logic allows entering another numerations for the branches of parallel programs, here each branch can have a different number in a different numeration system proper to different communicators. As communications between processes are encapsulated in the communicator, the libraries of parallel programs may be created on the basis of MPI. MPI library allows each function to have a second name. Therefore, if the function is called by one name already, the second name is temporaryly forgotten. This approach is called the tools of weak symbols.

<<< < Предыдущая 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 5758 / 6258 59 60 61 62 > Следующая >>>

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]

#
30.08.201981.92 Кб3Conflict. Lecture 7..doc
#
29.10.2018459.26 Кб8Cost Estimation book rus 20071210.doc
#
01.07.2025130.74 Кб0d-елементи VIII групи.docx
#
01.03.20257.03 Mб1D1.docx
#
19.02.20161.06 Mб6DC14Sample.pdf
#
01.05.20252.12 Mб1DC_Final_Fitted.doc
#
19.02.2016556.67 Кб12DDR 2 p.64-131.pdf
#
19.02.2016511.86 Кб9DDR 3 p.132-189.pdf
#
19.02.2016469.97 Кб8DDR 4 p.190-249.pdf
#
17.08.2019147.46 Кб2default.doc
#
01.05.2025241.66 Кб1default1.doc