Добавил:

Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.

Вуз:

Санкт-Петербургский государственный электротехнический университет "ЛЭТИ"

Предмет:

[НЕСОРТИРОВАННОЕ]

Файл:

MatrixCUDAFranDissertation.pdf

Скачиваний:

Добавлен:

22.03.2016

Размер:

2.18 Mб

Скачать

☆

<<< < Предыдущая 15 16 17 18 19 20 21 22 23 24 25 2627 / 4727 28 29 30 31 32 33 34 35 36 37 38 39 > Следующая >>>

4.7. REDUCTION TO TRIDIAGONAL FORM ON THE GRAPHICS PROCESSOR

The construction of QB that produces the reduction to band form can also be easily done in the GPU, as this computation is basically composed of calls to BLAS-3 kernels.

Reduction to tridiagonal form

Routine SBRDT in the SBR toolbox is responsible for reducing the banded matrix B to tridiagonal form by means of Householder reﬂectors. Let QT denote the orthogonal transforms which yield this reduction, that is QTT BQT = T . On exit, the routine returns the tridiagonal matrix T and, upon request, accumulates these transforms, forming the matrix Q = QBQT Rn×n so that

QT AQ = QTT (QTB AQB )QT = QTT BQT = T .

The matrix T is constructed in routine SBRDT one column at the time: at each iteration, the elements below the ﬁrst sub-diagonal of the current column are annihilated using a Householder reﬂector; the reﬂector is then applied to both sides of the matrix, resulting in a bulge which has to be chased down along the band. The computation is cast in terms of BLAS-2 operations at best (SYMV and SYR2 for two-sided updates, and GEMV and GER for one-sided updates) and the total cost is 6n2w + 8nw2 ﬂops.

Let Q denote the product of all the Householder reﬂectors needed to obtain T from B. These transforms are usually accumulated by applying the corresponding sequence of reﬂectors to the orthogonal matrix that results from the reduction of A to the band matrix B. These transformations can be applied in a blocked fashion by exploiting the WY representation [70]. The accumulation of Q is therefore entirely performed by the fast BLAS-3 GEMM kernel.

If the eigenvectors are desired, then the orthogonal matrix QB produced in the ﬁrst step (reduction from full to banded form) needs to be updated with the orthogonal transforms computed during the reduction from banded to tridiagonal form building QBQT . This accumulation requires O(n3) ﬂops and can be reformulated almost entirely in terms of calls to level-3 BLAS kernels, even though this reformulation is less trivial than for the ﬁrst step [32]. However, the matrix Q = QBQT still needs to be applied as part of the back-transform stage, to obtain X := QXT , adding 2n3 ﬂops to the cost of building the matrix containing the eigenvectors of A.

We do not propose to o -load the reduction of the banded matrix to tridiagonal form on the GPU as this is a ﬁne-grained computation which does not lend itself to an easy implementation on this architecture. However, the accumulation of the orthogonal factor produced by this step and the back-transform merely require operations alike the matrix-matrix product, which can be e ciently computed on the GPU.

4.7.4.Experimental Results

Evaluating the performance of the routines for the reduction to tridiagonal form is an elaborate task, due to the large number of factors that have an inﬂuence on it. Among them, we will focus on block size, bandwidth for the SBR toolbox, and number of cores. In the following experiments we aim at determining the optimal conﬁguration of these parameters, before proceeding to show a full comparison of the two approaches. For simplicity, we only report results for four problem sizes: 2,048, 6,144, 10,240 and 24,576, and the best bandwidths (w) and block sizes (b) detected in each case, though a complete experiment was carried out for many other values.

Building BLAS-2 and BLAS-3 kernels

We start by analyzing the performance of the low-level kernels involved in the reduction to condensed forms (tridiagonal and banded for the LAPACK and SBR approaches, respectively). For the LAPACK routine SYTRD, these are the symmetric matrix-vector product (SYMV) and the

109

CHAPTER 4. LAPACK-LEVEL ROUTINES ON SINGLE-GPU ARCHITECTURES

symmetric rank-2k update (SYR2K). For the SBR routine SYRDB, the kernels are the symmetric matrix-matrix product (SYMM) and SYR2K. Table 4.3 reports results on 1, 4 and 8 cores of the Intel Xeon Nehalem processors in PECO. For the BLAS-3 kernels (SYMM and SYR2K), we also report the performance on the single GPU in this platform using the implementation in CUBLAS as well as our own implementations, which cast most computations in terms of the general matrix-matrix product [83] (column labeled as “Our BLAS”). The matrix dimensions of SYR2K and SYMM are chosen so that they match the shape of the blocks encountered during the reduction to condensed forms (n is the problem size, while k plays the role of the block size b for SYTRD and that of the bandwidth w for SYRDB). For reference, we also include the performance of the general matrixvector product (GEMV) and matrix-matrix product (GEMM) kernels.

The performance of SYMV increases with the number of cores and is signiﬁcantly higher than that of GEMV. When the problem size is n = 2,048, the matrix ﬁts into the L3 caches of PECO (8 MB) which explains the much higher GFLOPS rate of SYMV. The same does not hold for GEMV as all the matrix needs to be stored and not just half of it (lower or upper triangle). We will consider these numbers as the basis for our investigation; in other words, justifying the ﬁgures yielded by the particular implementation of these kernels in the multi-threaded BLAS implementation is not the purpose of this dissertation.

The two BLAS-3 kernels involved in the computation increase their GFLOPS rate with the number of cores on PECO. This behavior plays a key role in the global performance of the whole reduction process on this architecture.

The results also illustrate the higher performance delivered by the GPU for most BLAS-3 kernels. Although our own implementations of the symmetric BLAS-3 kernels for the GPU deliver higher GFLOPS rates than those from CUBLAS, they are still quite below the performance of the matrix-matrix product kernel in CUBLAS. However, in all cases the performance attained by the GPU is higher than that delivered by the multi-core CPU. The improvement in the performance of the SYMM kernel is of special relevance for the reduction to tridiagonal form using the SBR routines.

The LAPACK approach

We next analyze the gains of a multi-core execution of the LAPACK routines SYTRD and, in case QXT is required, ORMTR. Table 4.4 reports the execution time (in seconds) for di erent values of problem size, block size (b) and number of cores.

Let us comment ﬁrst on the routine that performs the reduction from full to tridiagonal form, SYTRD. The block size does not play a role in the performance of it. Increasing the number of cores yields a reduction in the execution time on PECO but with moderate speed-up; for example, 3× and 4.2× speed-ups are attained for the largest problem sizes using, respectively, 4 and 8 cores.

The accumulation of the orthogonal transformations via routine ORMTR requires less time and is, in general, more e cient than the reduction stage. This is to be expected, as most of the computations in the accumulation are cast in terms of BLAS-3 kernels (GEMM), while only half of those in SYTRD are BLAS-3. Representative speed-ups of routine ORMTR in PECO are 3.7× and 6.6×, attained respectively using 4 and 8 cores for the largest problem size. Note also that this routine beneﬁts from using block sizes (e.g., 192 and 256) much larger than those that resulted optimal for SYTRD.

The SBR approach

We next study the parallelism of the two-step SBR approach: reduction of a general matrix to band form (SYRDB) and subsequent reduction to tridiagonal form (SBRDT). Also, we include in the

110

4.7. REDUCTION TO TRIDIAGONAL FORM ON THE GRAPHICS PROCESSOR

						SYMV. y := Ax + y								GEMV. y := Ax + y
						A Rn×n symmetric, x,y Rn								A Rn×n, x,y Rn
			n			1 core		4 cores		8 cores		1 core				4 cores			8 cores

		2048				8.26	34.5			60.0			5.15			8.47		8.31
		6144				7.80	17.7			21.6			5.07			9.13		10.7
	10240					6.69	18.3			22.1			4.70			9.26		11.2
	24576					5.75	16.0			21.0			3.16			8.45		10.8


						SYR2K. C:= ABT + BAT + C										GEMM. C := ABT + C
					A,B Rn×k, C Rn×n symmetric										A,B Rn×k, C Rn×n
n	k			1 core		4 cores		8 cores	CUBLAS		Our BLAS			1 core			4 cores		8 cores	CUBLAS


	32			14.0		55.4		91.0	53.2		53.2			15.0			58.6		116.4	157.2
2048	64			15.5		62.5		116.3	74.4		159.2			16.7			65.9		129.3	185.5
	96			16.5		65.3		122.2	78.0		162.9			17.3			68.4		135.9	192.2
	32			13.6		50.3		89.6	55.9		56.0			15.0			59.6		106.9	161.0
6144	64			15.7		59.8		112.3	78.4		124.2			16.7			66.6		123.1	185.0
	96			16.8		64.4		122.6	81.8		126.3			17.4			69.2		137.8	195.1
	32			13.8		51.2		83.9	56.4		56.4			15.0			59.6		116.5	159.3
10240	64			15.8		60.9		113.8	79.2		114.2			16.7			66.7		130.6	182.2
	96			16.9		65.3		123.7	78.1		116.7			17.5			69.4		137.7	187.3
	32			13.9		51.0		89.9	57.4		53.4			14.7			58.3		108.9	156.0
24576	64			16.0		60.9		113.6	79.1		116.9			16.6			64.6		129.9	186.2
	96			16.5		65.1		123.9	83.4		112.2			17.3			68.8		137.5	189.9

						SYMM. C := AB + C										GEMM. C := AB + C
					A Rn×n symmetric, B,C Rn×k										A Rn×n, B,C Rn×k
n	k			1 core		4 cores		8 cores	CUBLAS		Our BLAS			1 core			4 cores		8 cores	CUBLAS

	32			12.2		29.7		46.9	89.7		106.5			15.0			59.9		99.2	177.5
2048	64			14.7		45.2		75.4	97.1		183.4			16.7			66.2		105.3	279.0
	96			15.7		49.3		80.4	97.9		189.4			17.4			68.7		133.5	290.1
	32			11.9		28.5		42.9	94.1		129.6			15.1			59.5		116.7	327.5
6144	64			14.5		43.9		71.8	99.1		188.4			16.8			66.5		132.2	339.3
	96			15.6		48.5		77.5	100.4		198.1			17.5			69.3		132.2	338.2
	32			11.1		25.3		39.4	76.0		113.5			15.0			59.5		116.7	321.9
10240	64			13.9		40.5		66.6	76.5		175.8			16.8			66.6		132.4	346.9
	96			15.2		45.3		73.4	77.5		180.0			17.4			69.4		135.8	348.0

	32			10.8		24.8		37.9	77.8		110.0			14.7			58.3		108.4	328.0
20480	64			13.5		39.3		63.3	66.8		176.7			16.6			66.0		131.3	344.9
	96			15.0		44.6		65.7	65.9		179.0			17.3			68.9		135.3	346.0

Table 4.3: Performance (in GFLOPS) of the BLAS kernels SYMV (top), SYR2K (middle) and SYMM (bottom) and the corresponding matrix-vector and matrix-matrix products (for reference) on PECO. Peak performance for 1, 4 and 8 cores of this platform are 18.2, 72.6 and 145.3 GFLOPS, respectively.

111

CHAPTER 4. LAPACK-LEVEL ROUTINES ON SINGLE-GPU ARCHITECTURES

			SYTRD:				ORMTR:
		Full→Tridiagonal					Accum. QXT
n	b	1 core		4 cores	8 cores	b	1 core	4 cores	8 cores

	32	1.11		0.34	0.23	128	1.22	0.41	0.27
2048	64	1.11		0.35	0.23	192	1.23	0.41	0.28
	96	1.11		0.36	0.25	256	1.27	0.41	0.27

	32	40.4		11.4	9.2	128	28.8	8.60	5.20
6144	64	39.9		11.3	8.4	192	28.2	8.32	5.09
	96	40.1		11.3	10.4	256	29.0	8.46	5.08
	32	156.3		52.4	40.6	128	128.5	36.6	21.1
10240	64	152.5		51.9	40.5	192	127.4	35.9	21.1
	96	152.6		52.1	40.9	256	126.4	35.3	21.9
	32	2522.1		812.5	590.3	128	1767.1	488.1	275.2
24576	64	2453.6		796.8	600.9	192	1732.0	471.5	272.0
	96	2444.4		795.0	582.4	256	1720.0	466.0	262.7

Table 4.4: Execution time (in seconds) for the LAPACK routine(s) on PECO.

analysis the routines that construct the orthogonal factor Q (SYGTR to build QB and SBRDT to accumulate Q := QB QT ) and compute QXT (GEMM) in the back-transform. Remember that while the computational cost of the ﬁrst step is inversely proportional to the bandwidth, w, the cost of the second step is directly proportional to it. In other words, a larger bandwidth requires a smaller number of computations for the ﬁrst step, transferring a larger part of the ﬂops to the second step.

Table 4.5 displays results for these experiments. For the discussion, consider the following ﬁve cases:

1.Reduction of a dense matrix to banded form (Step 1). On the multi-core, the usage of a larger number of cores or the increase of the bandwidth results in a smaller execution time. The execution time on the GPU, on the other hand, is quite independent of w and outperforms the Intel-based architectures for all problem sizes except n = 2,048.

2.Reduction of banded matrix to tridiagonal form (Step 2). Using more than a single core yields no gain. As expected, a larger value of w translates into a longer execution time of this step.

3.Building the orthogonal factor resulting from Case 1 (Step 1). On the Intel cores, the execution time and parallelism of routine SYGTR is quite similar to those of SYRDB discussed in Case 1. Compared with the results obtained on 8 cores of PECO, the GPU in this platform accelerates the execution by a considerable factor, between 2.5×–3×.

4.Accumulating the orthogonal transforms corresponding to Case 2 (Step 2). By far, this is the most expensive operation of the ﬁve cases in the Intel cores, though it exhibits a certain degree of parallelism, which helps in reducing its weight on the overall process. The speed-up attained by the GPU for the larger problem dimensions is impressive.

5.Back-transform. The cost of this operation is comparable to that in Case 3. The best results is always attained with 8 cores and the GPU yields a notable acceleration.

Note that a study needs to take into account that the choice of bandwidth cannot be done independently from other factors. Therefore, we delay further comments on the data in the previous tables to the next subsection. There, we elaborate on the optimal combination of the factors that determine the overall performance of this approach.

112

4.7. REDUCTION TO TRIDIAGONAL FORM ON THE GRAPHICS PROCESSOR

								1st step (SYRDB):								2nd step (SBRDT):
										Full→Band						Band→Tridiagonal
			n	w		1 core			4 cores			8 cores	GPU		1 core			4 cores					8 cores

				32		0.89			0.34		0.23		0.21		0.37					1.64				1.72
		2048		64		0.81			0.28		0.19		0.20		0.45					1.08				1.03
				96		0.80			0.27		0.19		0.22		0.57					0.90				0.91

				32		23.6			8.3		5.2		2.78		3.48				14.93				17.1
		6144		64		20.8			6.2		3.7		2.27		4.88					9.92			10.1
				96		19.9			5.9		3.6		2.29		5.42					8.23				8.91
				32		112.6			41.1		26.7		10.81		9.51				41.1				43.3
		10240		64		95.9			29.6		18.7		9.72		11.7				27.5				26.3
				96		90.9			27.3		16.1		10.39		15.1				23.1				25.3
				32		1589.9			569.0		354.3		112.6		54.2			237.3					258.0
		24576		64		1330.2			404.3		235.5		99.3		72.9			159.2					157.7
				96		1251.2			370.7		220.5		105.3		96.8			133.3					140.3


				1st step (SYGTR):										2nd step (SBRDT):											Back-transform
																									(GEMM):
						Build QB							Accum. Q := QB QT												Comp. QXT
n	w		1 core		4 cores		8 cores			GPU		1 core		4 cores		8 cores					GPU				8 cores	GPU

	32		0.81		0.33		0.25			0.07		2.31		1.28			1.38					0.76
2048	64		0.73		0.26		0.20			0.04		1.86		0.83			0.55					0.42			0.12	0.07
	96		0.70		0.25		0.19			0.03		1.61		0.54			0.36					0.26
	32		21.2		7.35		7.04			1.68		65.4		33.0		35.7						6.24
6144	64		19.0		5.83		3.86			1.77		51.8		22.1		14.5						3.09			3.36	1.50
	96		18.3		5.62		3.67			1.75		44.3		14.5			9.5					1.74

	32		97.5		32.5		22.7			6.81		291.0		150.8		163.4					32.4
10240	64		87.5		25.6		17.2			6.44		235.1		102.3		66.6					12.6				15.0	6.46
	96		84.1		24.7		16.5			5.61		203.1		67.2		43.9						6.27
	32		1399.0		456.8		310.7			94.6		4149.5		2166.7		2403.4					101.8
24576	64		1232.0		353.0		217.3			88.0		3390.2		1465.0		956.6					55.3				209.5	89.3
	96		1177.0		377.7		207.6			81.0		2898.4		969.9		638.8					30.9

Table 4.5: Execution time (in seconds) for the SBR routines on PECO.

113

<<< < Предыдущая 15 16 17 18 19 20 21 22 23 24 25 2627 / 4727 28 29 30 31 32 33 34 35 36 37 38 39 > Следующая >>>

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]

#
22.03.20161.06 Mб226MATER_3.doc
#
18.11.2019295.42 Кб0MATLAB-1.doc
#
19.11.2019203.78 Кб0MATLAB-2.doc
#
09.02.20153.49 Mб22MATLAB-3.doc
#
09.02.2015344.3 Кб11Matrices.pdf
#
22.03.20162.18 Mб14MatrixCUDAFranDissertation.pdf
#
21.09.2019139.22 Кб2matved.docx
#
24.04.201933.9 Mб2maximum.docx
#
09.02.2015360.31 Кб63MA_1_пособие.pdf
#
09.02.201534.57 Mб8MA_Kudriav1.pdf
#
09.02.201526.97 Mб11MA_Kudriav2.pdf