- •Matrix computations on systems equipped with GPUs
- •Introduction
- •The evolution of hardware for High Performance Computing
- •The programmability issue on novel graphics architectures
- •About this document. Motivation and structure
- •Motivation and goals
- •Structure of the document
- •Description of the systems used in the experimental study
- •Performance metrics
- •Hardware description
- •Software description
- •The FLAME algorithmic notation
- •The architecture of modern graphics processors
- •The graphics pipeline
- •Programmable pipeline stages
- •The Nvidia G80 as an example of the CUDA architecture
- •The architecture of modern graphics processors
- •General architecture overview. Nvidia Tesla
- •Memory subsystem
- •The GPU as a part of a hybrid system
- •Arithmetic precision. Accuracy and performance
- •Present and future of GPU architectures
- •Conclusions and implications on GPU computing
- •BLAS on single-GPU architectures
- •BLAS: Basic Linear Algebra Subprograms
- •BLAS levels
- •Naming conventions
- •Storage schemes
- •BLAS on Graphics Processors: NVIDIA CUBLAS
- •Evaluation of the performance of NVIDIA CUBLAS
- •Improvements in the performance of Level-3 NVIDIA CUBLAS
- •gemm-based programming for the Level-3 BLAS
- •Systematic development and evaluation of algorithmic variants
- •Experimental results
- •Impact of the block size
- •Performance results for rectangular matrices
- •Performance results for double precision data
- •Padding
- •Conclusions
- •LAPACK-level routines on single-GPU architectures
- •LAPACK: Linear Algebra PACKage
- •LAPACK and BLAS
- •Naming conventions
- •Storage schemes and arguments
- •LAPACK routines and organization
- •Cholesky factorization
- •Scalar algorithm for the Cholesky factorization
- •Blocked algorithm for the Cholesky factorization
- •Computing the Cholesky factorization on the GPU
- •Basic implementations. Unblocked and blocked versions
- •Padding
- •Hybrid implementation
- •LU factorization
- •Scalar algorithm for the LU factorization
- •Blocked algorithm for the LU factorization
- •LU factorization with partial pivoting
- •Computing the LU factorization with partial pivoting on the GPU
- •Basic implementations. Unblocked and blocked versions
- •Padding and hybrid algorithm
- •Reduction to tridiagonal form on the graphics processor
- •The symmetric eigenvalue problem
- •Reduction to tridiagonal form. The LAPACK approach
- •Reduction to tridiagonal form. The SBR approach
- •Experimental Results
- •Conclusions
- •Matrix computations on multi-GPU systems
- •Linear algebra computation on multi-GPU systems
- •Programming model and runtime. Performance considerations
- •Programming model
- •Transfer management and spatial assignation
- •Experimental results
- •Impact of the block size
- •Number of data transfers
- •Performance and scalability
- •Impact of data distribution
- •Conclusions
- •Matrix computations on clusters of GPUs
- •Parallel computing memory architectures
- •Shared memory architectures
- •Distributed memory and hybrid architectures
- •Accelerated hybrid architectures
- •Parallel programming models. Message-passing and MPI
- •ScaLAPACK
- •PLAPACK
- •Elemental
- •Description of the PLAPACK infrastructure
- •Layered approach of PLAPACK
- •Usage of the PLAPACK infrastructure. Practical cases
- •Porting PLAPACK to clusters of GPUs
- •Experimental results
- •Conclusions
- •Conclusions
- •Conclusions and main contributions
- •Contributions for systems with one GPU
- •Contributions for clusters of GPUs
- •Related publications
- •Publications directly related with the thesis topics
- •Publications indirectly related with the thesis topics
- •Other publications
- •Open research lines
- •FLAME algorithms for the BLAS-3 routines
CHAPTER 6. MATRIX COMPUTATIONS ON CLUSTERS OF GPUS
Cholesky factorization on LONGHORN
|
5000 |
32 Quadro FX5800 on 32 nodes |
|
|
||
|
|
|
|
|||
|
|
32 Quadro FX5800 on 16 nodes |
|
|
||
|
4000 |
|
|
|
|
|
GFLOPS |
3000 |
|
|
|
|
|
2000 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1000 |
|
|
|
|
|
|
0 |
|
|
|
|
|
|
0 |
20000 |
40000 |
60000 |
80000 |
100000 |
Matrix size
Figure 6.17: Performance of the device-centric implementation of GEMM on 16 nodes of LONGHORN, using 1 or 2 GPUs per node.
are clear, especially for large matrices. On the other hand, in the host-centric approach, the size of the problem that can be solved is restricted by the amount of main memory in the system, which is usually larger than the device memory (see the tested sizes for the matrix-matrix multiplication in Figure 6.16). In principle, this can be solved transparently to the programmer in the device-centric approach, by handling the device memory as a cache of the host memory as proposed in Chapter 5.
Figure 6.17 shows the performance of the accelerated version of the Cholesky factorization in PLAPACK executed on 32 GPUs of LONGHORN. The results illustrate the di erence in performance between a configuration in which one GPU per node is used (using a total of 32 nodes) and that of a configuration where two GPUs per node are used for the calculations (for a total of 16 nodes). In the latter case, the penalty introduced by the common usage of the PCIExpress bus in each node is less than 8% for the largest matrices. Although the di erence in performance is non-negligible, the multi-GPU configuration delivers interesting performance results, and thus the trade-o between acquisition costs and raw performance must be also taken into account.
6.7.Conclusions
In previous chapters, we have demonstrated how multi-GPU systems can be an appealing solution to implement high-performance dense linear algebra routines. However, given the bottleneck introduced by the PCIExpress bus as the number of GPU sharing it increases, clusters of GPUs, with a reduced number of GPUs per node, are the natural evolution towards high performance GPU-based large-scale systems.
Porting existing distributed-memory codes to hybrid GPU-CPU clusters may be a challenging task. We have presented an approach to mechanically port the routines of the dense linear algebra message-passing library PLAPACK to a hybrid cluster consisting of nodes equipped with hardware accelerators. By initially placing all data in the memory of the accelerators, the number of PCIExpress transfers between the memories of host and device is reduced and performance is improved. All data transfers are embedded inside PLAPACK communication (copy) and consolidation (re-
198
6.7. CONCLUSIONS
duce) routines so that the retarget of the library routines is mostly automatic and transparent to the user.
The experimental results have demonstrated that the integration of GPUs in the nodes of a cluster is an e cient, cheap and scalable solution for the acceleration of large dense linear algebra problems. Furthermore, PLAPACK has also demonstrated its portability to novel architectures. From the perspective of the user, the development of GPU-accelerated codes becomes a transparent task with the adaptation of the library to clusters of GPUs.
199
CHAPTER 6. MATRIX COMPUTATIONS ON CLUSTERS OF GPUS
200