Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
MatrixCUDAFranDissertation.pdf
Скачиваний:
14
Добавлен:
22.03.2016
Размер:
2.18 Mб
Скачать

CHAPTER 6. MATRIX COMPUTATIONS ON CLUSTERS OF GPUS

Cholesky factorization on LONGHORN

 

5000

32 Quadro FX5800 on 32 nodes

 

 

 

 

 

 

 

 

32 Quadro FX5800 on 16 nodes

 

 

 

4000

 

 

 

 

 

GFLOPS

3000

 

 

 

 

 

2000

 

 

 

 

 

 

 

 

 

 

 

 

1000

 

 

 

 

 

 

0

 

 

 

 

 

 

0

20000

40000

60000

80000

100000

Matrix size

Figure 6.17: Performance of the device-centric implementation of GEMM on 16 nodes of LONGHORN, using 1 or 2 GPUs per node.

are clear, especially for large matrices. On the other hand, in the host-centric approach, the size of the problem that can be solved is restricted by the amount of main memory in the system, which is usually larger than the device memory (see the tested sizes for the matrix-matrix multiplication in Figure 6.16). In principle, this can be solved transparently to the programmer in the device-centric approach, by handling the device memory as a cache of the host memory as proposed in Chapter 5.

Figure 6.17 shows the performance of the accelerated version of the Cholesky factorization in PLAPACK executed on 32 GPUs of LONGHORN. The results illustrate the di erence in performance between a configuration in which one GPU per node is used (using a total of 32 nodes) and that of a configuration where two GPUs per node are used for the calculations (for a total of 16 nodes). In the latter case, the penalty introduced by the common usage of the PCIExpress bus in each node is less than 8% for the largest matrices. Although the di erence in performance is non-negligible, the multi-GPU configuration delivers interesting performance results, and thus the trade-o between acquisition costs and raw performance must be also taken into account.

6.7.Conclusions

In previous chapters, we have demonstrated how multi-GPU systems can be an appealing solution to implement high-performance dense linear algebra routines. However, given the bottleneck introduced by the PCIExpress bus as the number of GPU sharing it increases, clusters of GPUs, with a reduced number of GPUs per node, are the natural evolution towards high performance GPU-based large-scale systems.

Porting existing distributed-memory codes to hybrid GPU-CPU clusters may be a challenging task. We have presented an approach to mechanically port the routines of the dense linear algebra message-passing library PLAPACK to a hybrid cluster consisting of nodes equipped with hardware accelerators. By initially placing all data in the memory of the accelerators, the number of PCIExpress transfers between the memories of host and device is reduced and performance is improved. All data transfers are embedded inside PLAPACK communication (copy) and consolidation (re-

198

6.7. CONCLUSIONS

duce) routines so that the retarget of the library routines is mostly automatic and transparent to the user.

The experimental results have demonstrated that the integration of GPUs in the nodes of a cluster is an e cient, cheap and scalable solution for the acceleration of large dense linear algebra problems. Furthermore, PLAPACK has also demonstrated its portability to novel architectures. From the perspective of the user, the development of GPU-accelerated codes becomes a transparent task with the adaptation of the library to clusters of GPUs.

199

CHAPTER 6. MATRIX COMPUTATIONS ON CLUSTERS OF GPUS

200

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]