![](/user_photo/2706_HbeT2.jpg)
- •Matrix computations on systems equipped with GPUs
- •Introduction
- •The evolution of hardware for High Performance Computing
- •The programmability issue on novel graphics architectures
- •About this document. Motivation and structure
- •Motivation and goals
- •Structure of the document
- •Description of the systems used in the experimental study
- •Performance metrics
- •Hardware description
- •Software description
- •The FLAME algorithmic notation
- •The architecture of modern graphics processors
- •The graphics pipeline
- •Programmable pipeline stages
- •The Nvidia G80 as an example of the CUDA architecture
- •The architecture of modern graphics processors
- •General architecture overview. Nvidia Tesla
- •Memory subsystem
- •The GPU as a part of a hybrid system
- •Arithmetic precision. Accuracy and performance
- •Present and future of GPU architectures
- •Conclusions and implications on GPU computing
- •BLAS on single-GPU architectures
- •BLAS: Basic Linear Algebra Subprograms
- •BLAS levels
- •Naming conventions
- •Storage schemes
- •BLAS on Graphics Processors: NVIDIA CUBLAS
- •Evaluation of the performance of NVIDIA CUBLAS
- •Improvements in the performance of Level-3 NVIDIA CUBLAS
- •gemm-based programming for the Level-3 BLAS
- •Systematic development and evaluation of algorithmic variants
- •Experimental results
- •Impact of the block size
- •Performance results for rectangular matrices
- •Performance results for double precision data
- •Padding
- •Conclusions
- •LAPACK-level routines on single-GPU architectures
- •LAPACK: Linear Algebra PACKage
- •LAPACK and BLAS
- •Naming conventions
- •Storage schemes and arguments
- •LAPACK routines and organization
- •Cholesky factorization
- •Scalar algorithm for the Cholesky factorization
- •Blocked algorithm for the Cholesky factorization
- •Computing the Cholesky factorization on the GPU
- •Basic implementations. Unblocked and blocked versions
- •Padding
- •Hybrid implementation
- •LU factorization
- •Scalar algorithm for the LU factorization
- •Blocked algorithm for the LU factorization
- •LU factorization with partial pivoting
- •Computing the LU factorization with partial pivoting on the GPU
- •Basic implementations. Unblocked and blocked versions
- •Padding and hybrid algorithm
- •Reduction to tridiagonal form on the graphics processor
- •The symmetric eigenvalue problem
- •Reduction to tridiagonal form. The LAPACK approach
- •Reduction to tridiagonal form. The SBR approach
- •Experimental Results
- •Conclusions
- •Matrix computations on multi-GPU systems
- •Linear algebra computation on multi-GPU systems
- •Programming model and runtime. Performance considerations
- •Programming model
- •Transfer management and spatial assignation
- •Experimental results
- •Impact of the block size
- •Number of data transfers
- •Performance and scalability
- •Impact of data distribution
- •Conclusions
- •Matrix computations on clusters of GPUs
- •Parallel computing memory architectures
- •Shared memory architectures
- •Distributed memory and hybrid architectures
- •Accelerated hybrid architectures
- •Parallel programming models. Message-passing and MPI
- •ScaLAPACK
- •PLAPACK
- •Elemental
- •Description of the PLAPACK infrastructure
- •Layered approach of PLAPACK
- •Usage of the PLAPACK infrastructure. Practical cases
- •Porting PLAPACK to clusters of GPUs
- •Experimental results
- •Conclusions
- •Conclusions
- •Conclusions and main contributions
- •Contributions for systems with one GPU
- •Contributions for clusters of GPUs
- •Related publications
- •Publications directly related with the thesis topics
- •Publications indirectly related with the thesis topics
- •Other publications
- •Open research lines
- •FLAME algorithms for the BLAS-3 routines
![](/html/2706/253/html_VOJj7Ac4XZ.u5oG/htmlconvd-wvX_0w38x1.jpg)
CHAPTER 2. THE ARCHITECTURE OF MODERN GRAPHICS PROCESSORS
INPUT ASSEMBLER |
VERTEX |
|
|
|
SHADER |
|
GEOMETRY |
|
SHADER |
|
FRAGMENT |
SETUP & RASTERIZER |
RASTER OPERATIONS |
|
SHADER |
|
UNIFIED PROCESSOR |
|
ARRAY |
Figure 2.4: Cyclic approach of the graphics pipeline in the unified architectures.
The key contribution of this evolved architecture was the introduction of programmable stages, which in fact became the kernel of current graphics architectures, with plenty of fully-programmable processing units; this then led to the transformation of GPUs into a feasible target for generalpurpose computation with the appearance of the programmable units in the graphics pipeline implementation.
In addition to the hardware update, the introduction of new APIs for programming the GPU entailed a renewed interest in GPGPU. Between those APIs, the most successful ones were Cg [64] and HLSL [124], jointly developed by NVIDIA and Microsoft.
2.2.The Nvidia G80 as an example of the CUDA architecture
The mapping of this logical programmable pipeline onto the physical processor is what ultimately transformed the GPU computing scenario. In 2006, a novel architectural design was introduced by GPU vendors based on the idea of unified vertex and pixel processors. In this approach, there is no distinction between the units that perform the tasks for the vertex and the pixel processing. From this generation of GPUs on, all programming stages were performed by the same functional units, without taking into account the nature of the calculation to be done.
From the graphics perspective, the aim of this transformation was to reduce the unbalance that frequently occurred between vertex and pixel processing. Due to this unbalance, many of the functional units inside the GPU were basically idle for significant periods of time. In the unified architecture, there is only one type of processing unit, capable of executing both vertex and pixel operations. Thus, the sequential pipeline is transformed into a cyclic one, in which data recirculates through the processor. Data produced by one stage is used as an input to subsequent stages, using the same computational resources, but with a reconfigured behavior. Figure 2.4 illustrates this novel view of the graphics pipeline.
22