Добавил:

Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.

Вуз:

Санкт-Петербургский государственный электротехнический университет "ЛЭТИ"

Предмет:

[НЕСОРТИРОВАННОЕ]

Файл:

MatrixCUDAFranDissertation.pdf

Скачиваний:

Добавлен:

22.03.2016

Размер:

2.18 Mб

Скачать

☆

<<< < Предыдущая 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 4243 / 4743 44 45 46 47 > Следующая >>>

CHAPTER 6. MATRIX COMPUTATIONS ON CLUSTERS OF GPUS

Cholesky factorization on LONGHORN

	5000	32 Quadro FX5800 on 32 nodes
		32 Quadro FX5800 on 32 nodes
		32 Quadro FX5800 on 16 nodes
	4000
GFLOPS	3000
GFLOPS	2000
	2000
	1000
	0
	0	20000	40000	60000	80000	100000

Matrix size

Figure 6.17: Performance of the device-centric implementation of GEMM on 16 nodes of LONGHORN, using 1 or 2 GPUs per node.

are clear, especially for large matrices. On the other hand, in the host-centric approach, the size of the problem that can be solved is restricted by the amount of main memory in the system, which is usually larger than the device memory (see the tested sizes for the matrix-matrix multiplication in Figure 6.16). In principle, this can be solved transparently to the programmer in the device-centric approach, by handling the device memory as a cache of the host memory as proposed in Chapter 5.

Figure 6.17 shows the performance of the accelerated version of the Cholesky factorization in PLAPACK executed on 32 GPUs of LONGHORN. The results illustrate the di erence in performance between a conﬁguration in which one GPU per node is used (using a total of 32 nodes) and that of a conﬁguration where two GPUs per node are used for the calculations (for a total of 16 nodes). In the latter case, the penalty introduced by the common usage of the PCIExpress bus in each node is less than 8% for the largest matrices. Although the di erence in performance is non-negligible, the multi-GPU conﬁguration delivers interesting performance results, and thus the trade-o between acquisition costs and raw performance must be also taken into account.

6.7.Conclusions

In previous chapters, we have demonstrated how multi-GPU systems can be an appealing solution to implement high-performance dense linear algebra routines. However, given the bottleneck introduced by the PCIExpress bus as the number of GPU sharing it increases, clusters of GPUs, with a reduced number of GPUs per node, are the natural evolution towards high performance GPU-based large-scale systems.

Porting existing distributed-memory codes to hybrid GPU-CPU clusters may be a challenging task. We have presented an approach to mechanically port the routines of the dense linear algebra message-passing library PLAPACK to a hybrid cluster consisting of nodes equipped with hardware accelerators. By initially placing all data in the memory of the accelerators, the number of PCIExpress transfers between the memories of host and device is reduced and performance is improved. All data transfers are embedded inside PLAPACK communication (copy) and consolidation (re-

198

6.7. CONCLUSIONS

duce) routines so that the retarget of the library routines is mostly automatic and transparent to the user.

The experimental results have demonstrated that the integration of GPUs in the nodes of a cluster is an e cient, cheap and scalable solution for the acceleration of large dense linear algebra problems. Furthermore, PLAPACK has also demonstrated its portability to novel architectures. From the perspective of the user, the development of GPU-accelerated codes becomes a transparent task with the adaptation of the library to clusters of GPUs.

199

CHAPTER 6. MATRIX COMPUTATIONS ON CLUSTERS OF GPUS

200

<<< < Предыдущая 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 4243 / 4743 44 45 46 47 > Следующая >>>

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]

#
22.03.20161.06 Mб226MATER_3.doc
#
18.11.2019295.42 Кб0MATLAB-1.doc
#
19.11.2019203.78 Кб0MATLAB-2.doc
#
09.02.20153.49 Mб22MATLAB-3.doc
#
09.02.2015344.3 Кб11Matrices.pdf
#
22.03.20162.18 Mб14MatrixCUDAFranDissertation.pdf
#
21.09.2019139.22 Кб2matved.docx
#
24.04.201933.9 Mб2maximum.docx
#
09.02.2015360.31 Кб63MA_1_пособие.pdf
#
09.02.201534.57 Mб8MA_Kudriav1.pdf
#
09.02.201526.97 Mб11MA_Kudriav2.pdf