Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

UnEncrypted

.pdf
Скачиваний:
11
Добавлен:
16.05.2015
Размер:
6.75 Mб
Скачать

F. Almeida et al

consumption. Also, due to the design of the serial protocol connecting the PDU to the Gateway, a measurement takes at least 215 ms. A delay of approximately one second was observed between a measurement and the event that triggers it.

The CPU data file consists in time-stamped events, that can be crossed with the PDU data file to extract the values of current corresponding to each segment of execution. The integration of these current values and the corresponding timestamps generates data of measured power consumption of the parallel algorithm.

4Power performance model

In this section we propose an analytical power performance model similar to the execution time model introduced in Section 2 for the execution time of a master–slave application. We will use the well–know matrix–multiplication code to ilustrate the proposed model. First, we study the power–aware behavior of three di erent sequential matrix–multiplication implementation. Following, we will study the communication part of the master–slave application in terms on power consumption.

4.1Master-slave matrix–multiplication parallel implementation

To implement the transposed variant of the matrix multiplication we used a master-slave schema, with one master and p slaves, where the master process does not compute. Although each node had two processors, only one was used. The main script of experimentation contained sizes from 1024 to 6400, and every multiplication was performed 10 times. To minimize e ects related with the order of execution, this was randomized. Figure 4 shows the overall current profile for a 6400 by 6400 multiplication. The study of the resulting 260 pairs of files with PDU and CPU data allowed us to derive some facts.

Current kept constant during the broadcast, send and receive segments of code, but its level was higher than that of idle state.

The current of the master process kept constant at roughly the same level while the slaves performed the computation. We did not find any di erence between this level and the one related to the communications segments.

During the computation segments the current reached its maximum, and again kept constant throughout all the execution.

With all this data available we could estimate the total power consumed by the execution. Taking into account that power consumption is closely related to time, it makes sense to begin with the time model for the particular problem we are studying. From the fragment of code corresponding to the core of the algorithm executed, it could be derived that the time depends on the number of floating points operations 2n3 but also from the 2n3 accesses

c CMMSE

Page 63 of 1573

ISBN:978-84-615-5392-1

Modeling power performance for master–slave applications

PDU current data per node, Size = 6400

(A)

0.30

 

 

 

 

 

 

 

 

 

 

 

 

0.25

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

4

0.20

 

 

 

 

 

 

 

 

 

 

 

 

slave

 

 

 

 

 

 

 

 

 

 

 

 

0.15

 

 

 

 

 

 

 

 

 

 

 

 

0.10

 

 

 

 

 

 

 

 

 

 

 

 

0.05

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

0.000

50

100

150

200

250

300

350

400

450

500

550

600

(A)

0.30

 

 

 

 

 

 

 

 

 

 

 

 

0.25

 

 

 

 

 

 

 

 

 

 

 

 

3

0.20

 

 

 

 

 

 

 

 

 

 

 

 

slave

 

 

 

 

 

 

 

 

 

 

 

 

0.15

 

 

 

 

 

 

 

 

 

 

 

 

0.10

 

 

 

 

 

 

 

 

 

 

 

 

0.05

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

0.000

50

100

150

200

250

300

350

400

450

500

550

600

(A)

0.30

 

 

 

 

 

 

 

 

 

 

 

 

0.25

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

2

0.20

 

 

 

 

 

 

 

 

 

 

 

 

slave

 

 

 

 

 

 

 

 

 

 

 

 

0.15

 

 

 

 

 

 

 

 

 

 

 

 

0.10

 

 

 

 

 

 

 

 

 

 

 

 

0.05

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

0.000

50

100

150

200

250

300

350

400

450

500

550

600

(A)

0.30

 

 

 

 

 

 

 

 

 

 

 

 

0.25

 

 

 

 

 

 

 

 

 

 

 

 

1

0.20

 

 

 

 

 

 

 

 

 

 

 

 

slave

 

 

 

 

 

 

 

 

 

 

 

 

0.15

 

 

 

 

 

 

 

 

 

 

 

 

0.10

 

 

 

 

 

 

 

 

 

 

 

 

0.05

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

0.000

50

100

150

200

250

300

350

400

450

500

550

600

(A)

0.30

 

 

 

 

 

 

 

 

 

 

 

 

0.25

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

master

0.20

 

 

 

 

 

 

 

 

 

 

 

 

0.15

 

 

 

 

 

 

 

 

 

 

 

 

0.10

 

 

 

 

 

 

 

 

 

 

 

 

0.05

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

0.000

50

100

150

200

250

300

350

400

450

500

550

600

 

 

 

 

 

 

 

time (s)

 

 

 

 

 

 

Figure 4: PDU overall current data, matrix–multiplication on 4 processors

to memory (reads) and the n2 writes to memory.

f o r ( i

=

0 ; i < size; ++i ){

f o r

( j

=

0 ; j < size; ++j ){

sum

= 0 . 0 ;

f o r

(

k

= 0 ; k < size; ++k ){

sum +=

A [ i size + k ] B [ j size + k ] ;

}

C [ i size + j ] = sum ;

}

}

To check if the writes to memory a ected to the estimated power, a regression with positive coe cients was performed using R [11] and the package nnls [9]. It turned out that the write operation did not contribute to the power in this segment. The non zero coe cient obtained, F lopMem, includes the contribution of the two floating point operations and the two reads from memory. Thus the estimated time for the sequential case is simply:

Tcomp = 2n3 · F lopMem

(4)

c CMMSE

Page 64 of 1573

ISBN:978-84-615-5392-1

F. Almeida et al

 

Proc.

 

 

 

 

 

 

 

 

 

 

 

Slave #4

 

 

Wait

 

Rcv

 

 

Comp

 

 

 

Snd

 

 

 

 

 

 

 

 

 

 

 

 

Slave #3

 

Wait

Rcv

 

 

Comp

 

 

 

Snd

W

 

 

 

 

 

 

 

 

 

 

 

 

 

Bcast

 

 

 

 

 

 

 

 

 

 

 

Slave #2

 

W

Rcv

 

 

 

Comp

 

 

Snd

Wait

 

 

 

 

 

 

 

 

 

 

 

 

Slave #1

 

Rcv

 

 

 

 

Comp

 

 

Snd

Wait

 

Master

Bcast

Snd

Snd

Snd

Snd

Wait

Rcv

Rcv

Rcv

Rcv

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Power

Figure 5: Master-slave diagram of power consumption for matrix multiplication

The total power of the sequential code is obtained using the average current measured during the execution.

P Wcomp = Tcomp · Currcomp

(5)

4.2Model for power consumption

Following the performance model introduced in Section 2, we propose a model power compsumption based on equation 1. Each term of this equation contributes with a fraction of the total power consumption, depending on the average current measured in each segment and its duration. Figure 5 depicts again the master-slave schema, this time showing the power consumption. Note the contribution of the master node, lower than the computing nodes (has a lower p-state).

At bcast operation on eq. 1, it is necessary an average of the current used during the segment, a value that was obtained with the batch of executions, and that it is the same across the processors involved in the operation. The power consumption associated to the broadcast can be computed using:

P Wbcast = (p + 1) · Tbcast · Currbcast

(6)

where Tbcast is modeled by eq. 2.

The power estimation for the send by blocks segment is similar to the broadcast one. With the time needed to send a sub-block of the matrix A, the estimated power for the whole segment can be calculated. A summation expression is used for the sake of completeness and accuracy, although the total contribution of this part is very small.

p

 

i

 

P Wsnd = ( i + p) · Tsnd · Currsnd/rcv

(7)

=1

 

c CMMSE

Page 65 of 1573

ISBN:978-84-615-5392-1

Modeling power performance for master–slave applications

where Tsnd is modeled by eq. 3.

The expression for the computation segment includes the p computing nodes and the contribution of the master process while waiting for the slaves to end the computation work:

P Wcomp = p · Tcomp · Currcomp + P WMasterW ait

(8)

P WMasterW ait = [Tcomp (p 1) · Tsnd] · Currwait

(9)

The last segment of this master-slave consists in the master receiving the results from the slaves. The expression is identical to the send segment one:

p

 

i

 

P Wrcv = ( i + p) · Trcv · Currsnd/rcv

(10)

=1

 

Finally the complete expression for the power of the parallel execution is:

 

P Wtotal = P Wbcast + P Wsend + P Wcomp + P Wrecv

(11)

Our analysis of the power consumption model must end with a note on units. We have been using time · Curr as a proxy for energy, but this has to be corrected to be completely right. For convenience we have omitted to multiply the current by the voltage, to get V · A. Finally we get to the value of ApparentP ower in W · h with the appropriate conversion. We have considered the voltage constant during the executions, averaging the voltage of the outlets of the PDU.

5Model Validation

The model we have developed depends on architectural and algorithmic parameters that can be measured with the appropriate tests. We show between brackets the values measured in our configuration.

βbcast [5e-06 s] and τbcast [4.00641e-09 s], that characterizes the time that the network takes to broadcast a large message.

βsnd/rcv [0.0009130886 s] and τsnd/rcv [1.879013e-08 s], that characterizes the time that takes a node to send a large message to another node.

F lopMem (1.879013e-08 s] essentially characterizes the computing power of every node, and can be obtained with a simple timed for-loop.

Currbcast = Currsnd/rcv = Currcomm = Currwait [0.2248810 A above idle current] that is the current level at which the communication operations are performed.

CurrCmp [0.2921429 A above idle current] characterizes the maximum current per node when one processor is being used at full computation rate.

c CMMSE

Page 66 of 1573

ISBN:978-84-615-5392-1

F. Almeida et al

 

 

Measured

 

 

Modeled

 

 

Error (%)

 

Size

p = 4

p = 5

p = 6

p = 7

p = 4

p = 5

p = 6

p = 7

p = 4

p = 5

p = 6

p = 7

 

 

 

 

 

 

 

 

 

 

 

 

 

2000

25.73

25.452

25.338

24.265

26.14

25.29

24.73

24.33

1.59

-0.62

-2.39

0.27

3000

89.33

85.756

85.833

82.651

88.22

85.37

83.47

82.12

-1.25

-0.45

-2.75

-0.65

4000

212.73

201.985

200.7

196.969

209.11

202.36

197.86

194.64

-1.70

0.18

-1.42

-1.18

5000

412.46

393.303

387.977

383.807

408.41

395.23

386.44

380.16

-0.98

0.49

-0.40

-0.95

6000

714.54

670.201

669.880

656.017

705.73

682.96

667.77

656.92

-1.23

1.90

-0.31

0.14

Table 1: Power consumption for matrix–multiplication with 4 to 7 slaves, in A · s

Finally, table 1 shows values for measured and modeled matrix multiplications with 4 to 7 slaves. Column labeled E rror shows the relative error made by the prediction. The very low error observed, where highest error made is 1.7, allows to conclude that our analytical model has been validated and predicts the power consuption.

6Conclusions

We have analyzed the power consumption of the master–slave paradigm over an MPI application. Similar to the performance model in terms of execution time that can be obtained for these kind of implementation, it is possible to obtain an analytical expresion for the energy consumed by these codes while executed on HPC systems. We have implemented a power metered framework based on standard metered PDUs. The experimental infraestructure allows us to monitorize and model any application that can be executed in our cluster. As a case study, we model the matrix–multiplication algorithm by an analytical formula. With this expression we can predict the power consumed by the application on our cluster knowing the problem size and number, the number of slaves used and a set of parameters, architectural–dependent.

Acknowledgements

This work was supported by the Spanish Ministry of Education and Science through TIN201124598 and TIN2008-06570-C04-03 projects and through the FPU program. It also was supported by the Canarian Agency for Research, Innovation and Information Society under contract ProID20100222 and has been developed in the framework of the European network COST-ICT-0805 and the Spanish network CAPAP-H2.

References

[1]C.-Y. Chou, H.-Y. Chang, S.-T. Wang, and S.-C. Tcheng. Modeling message-passing overhead on nchc formosa pc cluster. In Y.-C. Chung and J. E. Moreira, editors, GPC,

c CMMSE

Page 67 of 1573

ISBN:978-84-615-5392-1

Modeling power performance for master–slave applications

volume 3947 of Lecture Notes in Computer Science, pages 299–307. Springer, 2006.

[2]S. P. E. Corporation. Spec power and performance, benchmarl methodology v2.1, Aug. 2011.

[3]J. Davis, S. Rivoire, M. Goldszmidt, and E. Ardestani. Accounting for variability in large-scale cluster power models. In The Second Exascale Evaluation and Research Techniques Workshop. Held in conjunction with ASPLOS 2011, 2011.

[4]V. W. Freeh, D. K. Lowenthal, F. Pan, N. Kappiah, R. Springer, B. Rountree, and M. E. Femal. Analyzing the energy-time trade-o in high-performance computing applications. IEEE Trans. Parallel Distrib. Syst., 18(6):835–848, 2007.

[5]L. Keys, S. Rivoire, and J. D. Davis. The search for energy-e cient building blocks for the data center. In A. L. Varbanescu, A. M. Molnos, and R. van Nieuwpoort, editors,

ISCA Workshops, volume 6161 of Lecture Notes in Computer Science, pages 172–182. Springer, 2010.

[6]M. Y. Lim, V. W. Freeh, and D. K. Lowenthal. Adaptive, transparent cpu scaling algorithms leveraging inter-node mpi communication regions. Parallel Computing, 37(10-11):667–683, 2011.

[7] D. Mart´ınez, J. Alb´ın, T. Pena, J. Cabaleiro, F. Rivera, and

V. Blanco. Us-

ing accurate

AIC–based performance models to improve the scheduling of

parallel

applications.

Journal of SuperComputing, 58(3):332–340, 2011.

In press:

Online

http://dx.doi.org/doi:10.1007/s11227-011-0589-1.

[8]D. Martinez, J. Cabaleiro, T. Pena, F. Rivera, and V. Blanco. Performance modeling of mpi applications using model selection techniques. In M. Danelutto, J. Bourgeais, and T. Gross, editors, 18th Euromicro Conference on Parallel, Distributed and Network– based Processing. PDP2010, pages 95–102, Pisa, Italy, Feb. 2010. IEEE Computer Society.

[9]K. M. Mullen and I. H. M. van Stokkum. nnls: The lawson-hanson algorithm for non-negative least squares (nnls), Apr. 2010.

[10]A. Schuermans. Schleifenbauer products bv, Mar. 2012.

[11]Various. R, language and environment for statistical computing and graphics, Mar. 2012.

[12]Z. Xu and K. Hwang. Modeling communication overhead: Mpi and mpl performance on the ibm sp2. Parallel Distributed Technology: Systems Applications, IEEE, 4(1):9 –24, spring 1996.

c CMMSE

Page 68 of 1573

ISBN:978-84-615-5392-1

Proceedings of the 12th International Conference on Computational and Mathematical Methods in Science and Engineering, CMMSE 2012 July, 2-5, 2012.

The solution of Block-Toeplitz linear systems of equations in multicore computers

Pedro Alonso1, Daniel Arg¨uelles2, Jos´e Ranilla2 and Antonio M. Vidal1

1 Departamento de Sistemas Inform´aticos y Computaci´on, Universitat Polit`ecnica de Val`encia, Spain

2 Departamento de Inform´atica, Universidad de Oviedo, Spain

emails: palonso@dsic.upv.es, daniel.arguelles.martino@gmail.com, ranilla@uniovi.es, avidal@dsic.upv.es

Abstract

There exist algorithms called “fast” which exploit the special structure of Toeplitz matrices so that, e.g., allow to solve a linear system of equations in O(n2) flops (instead of O(n3) flops required by classical algorithms). Due to the constantly increasing core count in current computers, it is necessary to parallelize such algorithms in order to get the most of the underlying hardware. In particular, we propose in this paper an e cient implementation of the Generalized Schur Algorithm, a very known algorithm for the solution of Toeplitz systems, to work on a block–Toeplitz matrix. Our algorithm is based on matrix–matrix multiplications with the aim at leveraging threaded routines that implement this operation.

Key words: Block-Toeplitz, linear systems, Generalized Schur Algorithm, multicorecomputers

MSC 2000: 15B05, 65F05, 68W04, 68W10

1Introduction

Block-Toeplitz matrices appear in many fields of engineering, e.g., in time series analysis in signal or image processing, and system identification, many times through the solution of a linear system of equations. This paper presents an implementation for the solution of

T x = b ,

(1)

PROMETEO/2009/013, Generalitat Valenciana. Projects TEC2009-13741, TIN2010-14971 and TIN2011-15734-E (CAPAP-H4) of the Ministerio de Ciencia e Innovaci´on. Spain

c CMMSE

Page 69 of 1573

ISBN:978-84-615-5392-1

 

 

 

 

 

 

 

Block–Toeplitz linear systems

where the system matrix T Rn×n is a symmetric block–Toeplitz matrix of the form

 

 

 

A0

A1T

A2T

...

 

 

 

 

A

A

AT

 

 

 

 

2

1

0

.

 

 

 

 

T =

 

A1

.

.

 

 

,

(2)

 

A0

A1

...

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

..

..

 

..

 

 

 

being each block Ai Rν×ν , i = 0, . . . , N − 1, full dense, and being b, x Rn the right- hand-side and the solution vectors, respectively. In this paper, matrix T (2) is supposed to be positive definite. For simplicity in the exposition we consider hereafter n as an integer multiple of the block size ν, although this restriction can be easily relaxed in the actual algorithms.

We address problem (1) by obtainig the Cholesky factor of T so T = CT C, being C upper triangular, through a well-known algorithm called the Generalized Schur Algorithm (GSA) [2]. The GSA has intrinsic parallelism due to the elements of a given row of C can be computed concurrently. This fact has already been successfully exploited, e.g., in [1], where it was proposed an implementation of the GSA for shared memory. In [3], a version of the GSA based on level 3 operations was proposed. However, the algorithm in [3] is based on complicated operations that are also di cult to implement in parallel.

We propose here a more simple version of the GSA for block-Toeplitz matrices than [3]. The algorithm is also based on level 3 operations, in particular, on matrix-matrix products. Furthermore, to the extent that there are available threaded routines which implement a matrix product, our algorithm may draw on the thread level parallelism of multiple cores.

Next section describes our implementation of the GSA algorithm for block-Toeplitz matrices. Section 3 gives some data about results it can be obtained with our implementation. At the end we o er some concluding remarks.

2The Generalized Schur Algorithm for Block–Toeplitz matrices

For the solution of system (1) we propose to perform the Cholesky factorization T = CT C, where C Rn×n is upper triangular. We use the GSA described in Algorithm 1 to compute this factor.

Algorithm 1 receives matrix G, called generator, which has the form

 

 

 

 

 

 

 

¯T

¯T

¯

 

 

 

 

 

G =

U

A1

A2

. . . AN−1

,

 

 

(3)

 

 

 

 

A1T A2T . . . AN−1

 

 

 

 

where U is the upper triangular Cholesky factor of A0 such that A0 = U

T

¯T

=

 

U, and Ai

U−T AT , for i = 1, . . . , N

1.

 

 

 

 

 

 

 

 

i

 

 

 

 

 

 

 

 

 

c CMMSE

Page 70 of 1573

ISBN:978-84-615-5392-1

Pedro Alonso et al.

Algorithm 1 Cholesky decomposition of T = CT C.

Require: Generator G (3), return Cholesky factor C.

1:

U = chol(A0)

 

 

 

 

 

 

 

 

(UT U = A0)

2:

Solve UT G2 =

AT

AT . . .

A

N−1

 

3:

C =

 

 

U

G2

 

1

 

2

 

 

First block-column of C

4:

n

ν

 

 

 

 

 

 

 

 

 

 

 

m =

 

 

 

 

 

 

 

 

 

 

 

5:

G1 = C(:, 1 : m)

 

 

 

 

 

 

 

 

6:

for i = 1 → N − 1 do

 

 

 

 

 

 

7:

 

 

G1

normalize

G1

 

 

Normalization of the generator

 

G2

G2

 

8:

C ←

 

 

C

 

 

 

 

 

 

Adding a new block-row into C

0ν×i·ν

G1

 

 

 

9:

G

1

 

G1(:, 1 : m

ν)

 

 

 

 

Updating G1 (Shift)

 

 

 

 

 

 

 

 

 

 

 

 

10:

G2 ← G2(:, ν + 1 : m)

 

 

 

 

Updating G2 (Shift)

11:m ← m − ν

12:end for

Step shown in line 7, which consists of a call to function normalize, performs a normalization of the generator G that overwrites it with its own proper form. Let the next partition of the generator

G =

G1

=

G11

G12

,

(4)

G2

G21

G22

where G11, G21 Rν×ν and G12, G22 Rν×m, then we say that G is in proper form if G21 is zero.

Algorithm 2 Routine normalize. Normalization of generator G.

Require: G11, G21, G12, G22 of partition (4), return G11, G21, G12, G22 (G21 is zero).

1:

[Q, R] = qr(G21), G21 ← R

G21 = QR

2:

H = I

Q

 

3:for j = 1 → ν do

4:for i = 1 → j do

5:hyper(G11(j, j : ν), G21(i, j : ν), H(:, j), H(:, i + ν))

6:end for

7:

end for

 

 

 

8:

 

G12

← HT

 

G12

 

G22

G22

c CMMSE

Page 71 of 1573

ISBN:978-84-615-5392-1

x x x x
0 x x x0 0 x x0 0 0 x
= ,x x x xx x x xx x x x
x x x x

Block–Toeplitz linear systems

At any given iteration of the algorithm, the first ν columns of the generator has the following form

G11

G21

where x just denote non zero entries. Step 7 of Algorithm 1 consists of zeroing the ν × ν entries of the down square (G21) by calling function normalize, (normalization). This step can be carried out in di erent ways. In this paper, we perform a QR decomposition of factor G21 inplace, so that G21 is replaced by the upper triangular factor R of its QR decomposition (step 1 of Algorithm 2). Below, a succession of hyperbolic transformations allows to nullify the remaining entries of G21, i.e., the entries on the upper triangular part (steps 3 to 7). Algorithm 3 calculates an hyperbolic transformation an applies it to two pair of arrays.

The key of Algorithm 2 is that it works on factors G11 and G21 to compute the transformations needed to set the generator in proper form. Instead of applying these transformations, a matrix H is built by accumulating those transformations. Consequently, step 8 of Algorithm 2, which is the largest one, has become a matrix-matrix product.

Algorithm 3 Routine hyper. Computation and updating of arrays with a hyperbolic rotation.

Require: Arrays u, v, x, y, return u, v, x, y.

1:α = u1

2:β = v1

3:if α2 ≥ β2 then

4:γ = α2 − β2

5:α ← α/γ

6:β ← β/γ

7:w = (αu − βv)

8:v ← (αv − βu)

9:u = w

10:w = (αx − βy)

11:y ← (αy − βx)

12:x = w

13:end if

c CMMSE

Page 72 of 1573

ISBN:978-84-615-5392-1

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]