Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
MatrixCUDAFranDissertation.pdf
Скачиваний:
14
Добавлен:
22.03.2016
Размер:
2.18 Mб
Скачать

APPENDIX A

FLAME algorithms for the BLAS-3 routines

The following algorithms correspond to the algorithmic variants of the routines SYMM (Figure A.1), SYR2K (Figure A.2), TRMM (Figure A.3), and TRSM (Figure A.4) developed and evaluated in Chapter 3. Algorithms are described using the FLAME notation.

215

APPENDIX A. FLAME ALGORITHMS FOR THE BLAS-3 ROUTINES

Algorithm: SYMM PP PM(C, A, B)

Partition C →

CB !

, B →

BB !

,

 

CT

 

BT

 

!

AT L AT R

A →

ABL ABR

where CT , BT have 0 rows, AT L is 0 × 0 while m(CT ) < m(C) do

Determine block size b Repartition

 

 

CT

 

 

C1 ,

 

BT

 

 

 

B1 ,

 

 

 

!

 

 

 

 

C0

 

 

 

 

 

 

 

!

 

 

 

 

B0

 

 

 

 

 

C2

 

 

 

 

B2

 

 

 

 

CB

BB

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

00

 

 

 

 

 

02

 

 

 

 

 

 

 

 

 

 

 

 

01

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

AT L

AT R

 

 

 

 

 

A

 

A

 

 

A

 

 

 

 

 

 

 

10

 

11

 

12

 

 

 

 

 

 

 

ABR !

 

 

 

 

ABL

 

 

A

 

A

 

 

A

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

A20

 

A21

 

 

A22

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

where

 

 

C ,

B

1

 

 

 

 

 

 

 

 

 

 

 

 

 

b

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1

 

 

have b rows, A11 is b

×

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Symm

 

pp

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

C0 := C0 + A10T B1

 

 

(GEMM)

 

 

 

 

 

 

 

 

C1 := C1 + A11B1

 

 

(SYMM)

 

 

 

 

 

 

 

 

C2 := C2 + A21B1

 

 

(GEMM)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Symm

 

pm

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

C1 := C1 + A10B0

 

 

(GEMM)

 

 

 

 

 

 

 

 

C1 := C1 + A11B1

 

 

(SYMM)

 

 

 

 

 

 

 

 

C1 := C1 + A21T B2

 

 

(GEMM)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Continue with

 

 

 

 

 

 

 

 

 

 

 

 

B1 ,

 

 

CT

 

 

C1 ,

 

BT

 

 

 

 

 

 

!

 

 

 

 

C0

 

 

 

 

 

 

 

!

 

 

 

 

B0

 

 

 

 

 

C2

 

 

 

 

B2

 

 

 

 

CB

 

BB

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

AT L AT R

 

 

 

A10

 

A11

 

 

A12

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

!

 

 

A00

 

A01

 

 

A02

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

A20

 

A21

 

 

A22

 

 

 

 

ABL

ABR

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

endwhile

Algorithm: SYMM MP(C, A, B)

Partition

C → CL

 

CR ,

 

 

C

 

 

 

B → BL

BR

 

 

 

where

L has 0 columns,

BL has 0 columns while n(CL) < n(C) do

Determine block size b Repartition

CL CR → C0 C1 C2 ,

BL BR → B0 B1 B2 where C1 has b columns, B1 has b columns

C1 := C1 + AB1 (SYMM)

Continue with

CL CR ← C0 C1 C2 ,

endwhile

 

B

 

← B

 

B

 

B

 

 

 

 

BL

 

 

R

 

0

 

1

 

2

Figure A.1: Algorithms for SYMM.

216

Algorithm: SYR2K MP(A, B, C)

Partition A →

AB !

, B →

BB !

,

 

AT

 

BT

 

!

CT L CT R

C →

CBL CBR

where AT has 0 rows, BT has 0 rows, CT L is 0 × 0

while m(AT ) < m(A) do Determine block size b Repartition

 

 

AT

 

A1 ,

 

BT

 

 

 

B1 ,

 

 

 

 

!

 

 

A0

 

 

 

 

 

 

 

!

 

 

 

 

 

B0

 

 

 

 

 

A2

 

 

 

 

 

B2

 

 

 

AB

BB

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

00

 

 

 

 

02

 

 

 

 

 

 

 

 

 

 

 

 

01

 

 

 

 

 

 

 

 

 

 

 

 

 

 

CT L

CT R

!

 

C

 

C

 

 

C

 

 

 

 

10

 

11

 

 

12

 

 

 

 

 

 

 

 

 

 

 

 

 

CBL

CBR

 

 

C

 

C

 

 

C

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

C20

 

C21

 

 

C22

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

A

 

has b

 

 

 

 

 

 

 

 

 

 

 

where

 

1

 

 

 

rows, B1 has b rows,

 

 

 

 

 

 

 

C11 is b × b

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Syr2k

 

mp

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

C01 := C01 + A0B1T

 

 

 

 

 

 

 

(GEMM)

 

 

C01 := C01 + B0A1T

 

 

 

 

 

 

 

 

 

(GEMM)

 

 

C11 := C11 + A1B1T + B1A1T

 

 

(SYR2K)

 

Syr2k

 

pm

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

C12 := C12 + A1B2T

 

 

 

 

 

 

 

(GEMM)

 

 

C12 := C12 + B1A2T

 

 

 

 

 

 

 

 

 

(GEMM)

 

 

C11 := C11 + A1B1T + B1A1T

 

 

(SYR2K)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Continue with

 

 

 

 

 

 

 

 

 

 

 

B1 ,

 

 

AT

 

A1 ,

 

BT

 

 

 

 

 

 

 

!

 

 

A0

 

 

 

 

 

 

 

!

 

 

 

 

 

B0

 

 

 

 

 

A2

 

 

 

 

 

B2

 

 

 

AB

 

BB

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

CT L CT R

 

 

 

C10

 

C11

 

 

C12

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

!

 

 

C00

 

C01

 

 

C02

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

C20

 

C21

 

 

C22

 

 

 

CBL

CBR

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

endwhile

Algorithm: SYR2K PP(A, B)

Partition

A → AL

 

 

AR ,

 

 

 

 

 

A

 

 

 

 

 

 

 

 

B → BL

BR

 

 

 

 

 

 

 

 

 

where

L has 0 columns,

 

 

 

 

BL has 0 columns

while n(AL) < n(A) do

 

 

Determine block size b

 

 

Repartition

 

 

AL

 

AR → A0

 

A1

 

A2 ,

 

 

 

 

where

 

 

 

 

 

 

 

 

 

 

BL BR

→ B0

B1

B2

 

 

 

 

A1 has b columns,

 

 

 

 

B1 has b columns

 

 

 

 

 

C := C + A1B1T + B1A1T

(SYR2K)

Continue with

 

 

 

 

 

 

 

 

 

AL

 

 

 

AR ← A0

 

 

A1

 

 

A2 ,

 

 

 

 

endwhile

 

B

 

← B

 

 

 

B

 

 

 

B

 

 

 

 

 

 

 

 

 

BL

 

 

R

 

0

 

 

 

1

 

 

 

2

Figure A.2: Algorithms for SYR2K.

217

APPENDIX A. FLAME ALGORITHMS FOR THE BLAS-3 ROUTINES

Algorithm: TRMM PP PM(A, B)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

AT L

AT R

 

 

 

 

 

BT

 

Partition

 

A →

 

 

 

 

 

 

 

!

, B →

 

!

 

 

ABL

ABR

BB

 

 

where

AT L is 0 × 0, BT has 0 rows

 

 

while m(AT L) < m(A) do

 

 

 

 

 

 

 

 

 

 

 

Determine block size b

 

 

 

 

 

 

 

 

 

 

 

Repartition

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

AT L AT R

 

 

 

A10

 

A11

 

 

A12 ,

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

!

 

A00

 

A01

 

 

A02

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

A20

 

A21

 

 

A22

 

 

 

 

 

ABL

ABR

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

0

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

BT

 

 

 

 

 

 

B

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1

 

 

 

 

 

 

 

 

 

 

 

 

 

BB !

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

B

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

B2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

where

 

 

A is b b , B1 has b rows

 

 

 

 

 

 

 

 

 

 

 

 

11

 

 

 

×

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Trmm

 

pp

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

B0 := B0 + A01B1

 

(GEMM)

 

 

 

 

 

 

 

 

 

B1 := A11B1

 

 

 

 

(TRMM)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Trmm

 

pm

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

B1 := A11B1

 

 

 

 

(TRMM)

 

 

 

 

 

 

 

 

 

B1 := B1 + A12B2

 

(GEMM)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Continue with

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

AT L AT R

 

 

 

A10

 

A11

 

 

A12 ,

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

!

 

A00

 

A01

 

 

A02

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

A20

 

A21

 

 

A22

 

 

 

 

 

ABL

ABR

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

B0

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

BT

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

B1

 

 

 

 

 

 

 

 

 

 

 

 

 

BB !

 

 

 

 

 

 

 

 

 

 

 

 

 

B2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

endwhile

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Algorithm: TRMM MP(A, B)

where

B

 

 

 

 

Partition

B →

BL

BR

 

 

 

L has

0 columns

while n(BL) < n(B)

do

Determine block size b

Repartition

 

 

 

 

 

 

 

BL

BR

 

→ B0

B1

B2

where

B1 has b columns

B1 := AB1 (TRMM)

Continue with

BL BR ← B0 B1 B2 endwhile

Figure A.3: Algorithms for TRMM.

218

Algorithm: TRSM PP PM(A, B)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

AT L

AT R

 

 

 

 

Partition

 

A →

 

 

 

 

 

 

 

 

! ,

 

 

 

 

ABL

ABR

 

 

 

 

 

 

 

 

A

is 0 × 0,

 

 

 

 

 

 

 

 

 

 

 

 

B → BL

 

BR

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

where

 

T L

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

BL has 0 columns

 

 

 

 

while m(AT L) < m(A)

 

 

do

 

 

 

 

 

 

 

 

 

 

Determine block size b

 

 

 

 

 

 

 

 

 

 

Repartition

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

AT L AT R

 

 

A10

 

A11

 

A12 ,

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

!

 

 

 

 

 

00

 

 

01

 

 

02

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

A

 

 

A

 

A

 

 

 

ABL

 

ABR

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

A20

 

A21

 

A22

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

where

 

 

 

×

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

BL

 

BR

→ B0

 

B1

B2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

A11 is b

 

 

b ,

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

B1 has b columns

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Trsm

 

pp

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

B1 := B1 − B0A10T

 

 

 

(GEMM)

 

 

 

 

 

 

B1 := B1A11−T

 

 

 

 

(TRSM)

 

 

 

 

 

 

Trsm

 

pm

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

B1 := B1A11−T

 

 

 

 

(TRSM)

 

 

 

 

 

 

B2 := B2 − B1A21T

 

 

 

(GEMM)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Continue with

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

AT L AT R

 

 

A10

 

A11

 

 

A12 ,

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

!

 

 

 

A00

 

A01

 

 

A02

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

A20

 

A21

 

 

A22

 

 

 

ABL

ABR

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

endwhile

 

B

 

 

← B

 

B

 

B

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

BL

 

 

 

 

R

 

 

 

 

 

0

 

 

 

1

 

 

 

2

 

 

 

 

 

 

Algorithm: TRSM MP(A, B)

!

BT

Partition B →

BB where BT has 0 rows

while m(BT ) < m(B) do Determine block size b Repartition

 

BT

B1

 

 

 

!

 

0

 

 

 

 

B

 

 

 

BB

 

 

 

 

 

 

 

 

B2

 

 

 

 

 

 

 

 

 

has

 

where

B1

 

b rows

 

 

 

B1 := B1A−T

(TRSM)

Continue with

BT

 

B1

 

 

!

 

 

B0

 

 

B2

 

BB

 

 

 

 

 

 

 

 

 

 

endwhile

Figure A.4: Algorithms for TRSM.

219

Bibliography

[1]NVIDIA CUDA Architecture. Introduction and Overview. NVIDIA Corporation, 2009.

[2]NVIDIA CUDA C Programming Best Practices Guide. NVIDIA Corporation, 2009.

[3]NVIDIA’s Next Generation CUDA Compute Architecture: Fermi Whitepaper. NVIDIA Corporation, 2010.

[4]ADVANCED MICRO DEVICES INC. http://www.amd.com/acml.

[5]AGARWAL, R. C., AND GUSTAVSON, F. G. Vector and parallel algorithms for cholesky factorization on ibm 3090. In Supercomputing ’89: Proceedings of the 1989 ACM/IEEE conference on Supercomputing (New York, NY, USA, 1989), ACM Press, pp. 225–233.

[6]AGULLO, E., DEMMEL, J., DONGARRA, J., HADRI, B., KURZAK, J., LANGOU, J., LTAIEF,

H., LUSZCZEK, P., AND TOMOV, S. Numerical linear algebra on emerging architectures: The plasma and magma projects. Journal of Physics: Conference Series Vol. 180 .

[7]ALPATOV, P., BAKER, G., EDWARDS, C., GUNNELS, J., MORROW, G., OVERFELT, J.,

VAN DE GEIJN, R., AND WU, Y.-J. J. PLAPACK: Parallel linear algebra package – design overview. In Proceedings of SC97 (1997).

[8]ANDERSEN, B. S., GUSTAVSON, F. G., AND WASNIEWSKI, J. A recursive formalation of

Cholesky factorization of a matrix in packed storage. LAPACK Working Note 146 CS-00-441, University of Tennessee, Knoxville, May 2000.

[9]ANDERSON, E., BAI, Z., DEMMEL, J., DONGARRA, J. E., DUCROZ, J., GREENBAUM, A.,

HAMMARLING, S., MCKENNEY, A. E., OSTROUCHOV, S., AND SORENSEN, D. LAPACK

Users’ Guide. SIAM, Philadelphia, 1992.

[10]ANDERSON, E., BENZONI, A., DONGARRA, J., MOULTON, S., OSTROUCHOV, S.,

TOURANCHEAU, B., AND VAN DE GEIJN, R. Basic linear algebra communication subprograms. In Sixth Distributed Memory Computing Conference Proceedings (1991), IEEE Computer Society Press, pp. 287–290.

[11]ANDERSON, E., BENZONI, A., DONGARRA, J., MOULTON, S., OSTROUCHOV, S.,

TOURANCHEAU, B., AND VAN DE GEIJN, R. Lapack for distributed memory architectures:

221

BIBLIOGRAPHY

Progress report. In Proceedings of the Fifth SIAM Conference on Parallel Processing for Scientific Computing (Philadelphia, 1992), SIAM, pp. 625–630.

[12]ANDERSON, E., DONGARRA, J., AND OSTROUCHOV, S. Lapack working note 41: Installation guide for lapack. Tech. rep., Knoxville, TN, USA, 1992.

[13]AUGONNET, C., THIBAULT, S., NAMYST, R., AND WACRENIER, P.-A. StarPU: A Unified

Platform for Task Scheduling on Heterogeneous Multicore Architectures. In Proceedings of the 15th International Euro-Par Conference (Delft, The Netherlands, Aug. 2009), vol. 5704 of Lecture Notes in Computer Science, Springer, pp. 863–874.

[14]AYGUADE, E., BADIA, R., BELLENS, P., CABRERA, D., DURAN, A., FERRER, R., GONZA-

LEZ, M., IGUAL, F., JIMENEZ-GONZALEZ, D., LABARTA, J., MARTINELL, L., MARTORELL, X., MAYO, R., PEREZ, J., PLANAS, J., AND QUINTANA-ORTI, E. Extending OpenMP to

survive the heterogeneous multi-core era. International Journal of Parallel Programming 38

(2010), 440–459. 10.1007/s10766-010-0135-4.

[15]AYGUADE, E., BADIA, R. M., CABRERA, D., DURAN, A., GONZALEZ, M., IGUAL, F.,

JIMENEZ, D., LABARTA, J., MARTORELL, X., MAYO, R., PEREZ, J. M., AND QUINTANA-

ORT´ı, E. S. A proposal to extend the OpenMP tasking model for heterogeneous architectures. In IWOMP ’09: Proceedings of the 5th International Workshop on OpenMP (Berlin, Heidelberg, 2009), Springer-Verlag, pp. 154–167.

[16]AYGUAD´E, E., BADIA, R. M., IGUAL, F. D., LABARTA, J., MAYO, R., AND QUINTANA- ORT´ı, E. S. An extension of the StarSs programming model for platforms with multiple GPUs. In Euro-Par (2009), pp. 851–862.

[17]BAD´ıA, J. M., MOVILLA, J. L., CLIMENTE, J. I., CASTILLO, M., MARQU´ES, M., MAYO, R., QUINTANA-ORT´ı, E. S., AND PLANELLES, J. Large-scale linear system solver using secondary storage: Self-energy in hybrid nanostructures. Computer Physics Communications 182, 2 (2011), 533–539.

[18]BAKER, G., GUNNELS, J., MORROW, G., RIVIERE, B., AND VAN DE GEIJN, R. PLAPACK:

High performance through high level abstraction. In Proceedings of ICCP98 (1998).

[19]BARGEN, B., AND DONNELLY, T. P. Inside DirectX. Microsoft Press, 1998.

[20]BARRACHINA, S., CASTILLO, M., IGUAL, F. D., MAYO, R., AND QUINTANA-ORT´ı, E. S.

Evaluation and tuning of the level 3 CUBLAS for graphics processors. In Proceedings of the 10th IEEE Workshop on Parallel and Distributed Scientific and Engineering Computing, PDSEC 2008 (2008), pp. CD–ROM.

[21]BARRACHINA, S., CASTILLO, M., IGUAL, F. D., MAYO, R., AND QUINTANA-ORT´ı, E. S.

FLAG@lab: An M-script API for linear algebra operations on graphics processors. In Proceedings of PARA 2008 (2008), LNCS, Springer-Verlag Berlin Heidelberg. To appear.

[22]BARRACHINA, S., CASTILLO, M., IGUAL, F. D., MAYO, R., AND QUINTANA-ORT´ı, E. S.

Solving dense linear systems on graphics processors. In Proceedings of the 14th International Euro-Par Conference (2008), E. Luque, T. Margalef, and D. Ben´ıtez, Eds., Lecture Notes in Computer Science, 5168, Springer, pp. 739–748.

222

BIBLIOGRAPHY

[23]BARRACHINA, S., CASTILLO, M., IGUAL, F. D., MAYO, R., QUINTANA-ORT´ı, E. S., AND

QUINTANA-ORT´ı, G. Exploiting the capabilities of modern gpus for dense matrix computations. Concurrency and Computation: Practice and Experience 21, 18 (2009), 2457–2477.

[24]BELLENS, P., P´EREZ, J. M., BAD´ıA, R. M., AND LABARTA, J. CellSs: a programming model for the Cell BE architecture. In Proceedings of the 2006 ACM/IEEE conference on Supercomputing – SC2006 (New York, NY, USA, 2006), ACM Press, p. 86.

[25]BELLENS, P., PEREZ, J. M., CABARCAS, F., RAMIREZ, A., BADIA, R. M., AND LABARTA,

J. Cellss: Scheduling techniques to better exploit memory hierarchy. Sci. Program. 17, 1-2 (2009), 77–95.

[26]BIENTINESI, P. Mechanical Derivation and Systematic Analysis of Correct Linear Algebra Algorithms. PhD thesis, Department of Computer Sciences, University of Texas at Austin, 2006.

[27]BIENTINESI, P., DHILLON, I. S., AND VAN DE GEIJN, R. A parallel eigensolver for dense symmetric matrices based on multiple relatively robust representations. SIAM J. Sci. Comput. 27, 1 (2005), 43–66.

[28]BIENTINESI, P., GUNNELS, J. A., GUSTAVSON, F. G., HENRY, G. M., MYERS, M. E., QUINTANA-ORT´ı, E. S., AND VAN DE GEIJN, R. A. The science of programming highperformance linear algebra libraries. In Proceedings of Performance Optimization for High-

Level Languages and Libraries (POHLL-02) , a workshop in conjunction with the 16th Annual ACM International Conference on Supercomputing (ICS’02) (2002).

[29]BIENTINESI, P., GUNNELS, J. A., MYERS, M. E., QUINTANA-ORT´ı, E. S., AND VAN DE

GEIJN, R. A. The science of deriving dense linear algebra algorithms. ACM Transactions on Mathematical Software 31, 1 (Mar. 2005), 1–26.

[30]BIENTINESI, P., IGUAL, F. D., KRESSNER, D., AND QUINTANA-ORT´ı, E. S. Reduction to

condensed forms for symmetric eigenvalue problems on multi-core architectures. In Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I (Berlin, Heidelberg, 2010), PPAM’09, Springer-Verlag, pp. 387–395.

[31]BISCHOF, C., AND LOAN, C. V. The WY representation for products of Householder matrices. SIAM Journal on Scientific and Statistical Computing 8, 1 (1987), s2–s13.

[32]BISCHOF, C. H., LANG, B., AND SUN, X. Algorithm 807: The SBR Toolbox—software for successive band reduction. ACM Trans. Math. Softw. 26, 4 (2000), 602–616.

[33]BLACKFORD, L. S., CHOI, J., CLEARY, A., D’AZEVEDO, E., DEMMEL, J., DHILLON, I.,

DONGARRA, J., HAMMARLING, S., HENRY, G., PETITET, A., STANLEY, K., WALKER, D.,

AND WHALEY, R. C. ScaLAPACK Users’ Guide. SIAM, 1997.

[34]BLUMOFE, R. D., JOERG, C. F., KUSZMAUL, B. C., LEISERSON, C. E., RANDALL, K. H.,

AND ZHOU, Y. Cilk: An e cient multithreaded runtime system. In Journal of parallel and distributed computing (1995), pp. 207–216.

[35]BLUMOFE, R. D., AND LEISERSON, C. E. Scheduling multithreaded computations by work stealing. J. ACM 46, 5 (1999), 720–748.

223

BIBLIOGRAPHY

[36]BOSILCA, G., BOUTEILLER, A., DANALIS, A., FAVERGE, M., HAIDAR, H., HERAULT, T., KURZAK, J., LANGOU, J., LEMARINIER, P., LTAIEF, H., LUSZCZEK, P., YARKHAN, A.,

AND DONGARRA, J. Distributed-memory task execution and dependence tracking within

dague and the dplasma project. Tech. rep., Innovative Computing Laboratory, University of Tennessee.

[37]BREWER, O., DONGARRA, J., AND SORENSEN, D. Tools to aid in the analysis of memory access patterns for FORTRAN programs. LAPACK Working Note 6, Technical Report MCS- TM-120, June 1988.

[38]BUTTARI, A., DONGARRA, J., LANGOU, J., LANGOU, J., LUSZCZEK, P., AND KURZAK, J.

Mixed precision iterative refinement techniques for the solution of dense linear systems. Int. J. High Perform. Comput. Appl. 21, 4 (2007), 457–466.

[39]CASTILLO, M., CHAN, E., IGUAL, F. D., MAYO, R., QUINTANA-ORTI, E. S., QUINTANA-

ORTI, G., VAN DE GEIJN, R., AND ZEE, F. G. V. Making programming synonymous with programming for linear algebra libraries. Tech. rep., The University of Texas at Austin, Department of Computer Sciences. Technical Report TR-08-20, 2008.

[40]CASTILLO, M., IGUAL, F. D., MARQU´ES, M., MAYO, R., QUINTANA-ORT´ı, E. S.,

QUINTANA-ORT´ı, G., RUBIO, R., AND VAN DE GEIJN, R. A. Out-of-core solution of linear systems on graphics processors. International Journal of Parallel, Emergent and Distributed Systems 24, 6 (2009), 521–538.

[41]CHAN, E. Application of Dependence Analysis and Runtime Data Flow Graph Scheduling to Matrix Computations. PhD thesis, The University of Texas at Austin, 2010.

[42]CHAN, E., AND IGUAL, F. D. Runtime data graph scheduling of matrix computations with multiple hardware accelerators. FLAME Working Note #50. Technical Report TR-10-36, The University of Texas at Austin, Department of Computer Sciences, Oct. 2010.

[43]CHAN, E., QUINTANA-ORT´ı, E. S., QUINTANA-ORT´ı, G., AND VAN DE GEIJN, R. Super-

Matrix out-of-order scheduling of matrix operations for SMP and multi-core architectures. In SPAA ’07: Proceedings of the Nineteenth ACM Symposium on Parallelism in Algorithms and Architectures (San Diego, CA, USA, June 9-11 2007), ACM, pp. 116–125.

[44]CHAN, E., VAN DE GEIJN, R., AND CHAPMAN, A. Managing the complexity of lookahead for lu factorization with pivoting. In SPAA ’10: Proceedings of the 22nd ACM symposium on Parallelism in algorithms and architectures (New York, NY, USA, 2010), ACM, pp. 200–208.

[45]CHAN, E., VAN ZEE, F. G., BIENTINESI, P., QUINTANA-ORT´ı, E. S., QUINTANA-ORT´ı,

G., AND VAN DE GEIJN, R. SuperMatrix: A multithreaded runtime scheduling system for algorithms-by-blocks. In ACM SIGPLAN 2008 symposium on Principles and Practices of Parallel Programming – PPoPP 2008 (2008), pp. 123–132.

[46]CHAN, E., VAN ZEE, F. G., QUINTANA-ORT´ı, E. S., QUINTANA-ORT´ı, G., AND VAN DE

GEIJN, R. Satisfying your dependencies with SuperMatrix. In Proceedings of IEEE Cluster Computing 2007 (September 2007), pp. 91–99.

[47]CHAPMAN, B., JOST, G., AND PAS, R. V. D. Using OpenMP: Portable Shared Memory Parallel Programming (Scientific and Engineering Computation). The MIT Press, 2007.

224

BIBLIOGRAPHY

[48]CHOI, J., DONGARRA, J., OSTROUCHOV, S., PETITET, A., WALKER, D. W., AND WHA-

LEY, R. C. A proposal for a set of parallel basic linear algebra subprograms. In PARA ’95:

Proceedings of the Second International Workshop on Applied Parallel Computing, Computations in Physics, Chemistry and Engineering Science (London, UK, 1996), Springer-Verlag, pp. 107–114.

[49]CHOI, J., DONGARRA, J. J., POZO, R., AND WALKER, D. W. Scalapack: A scalable linear algebra library for distributed memory concurrent computers. In Proceedings of the Fourth Symposium on the Frontiers of Massively Parallel Computation (1992), IEEE Comput. Soc. Press, pp. 120–127.

[50]CLEARSPEED. http://www.clearspeed.com.

[51]DEMMEL, J., DONGARRA, J. J., DU CROZ, J., GREENBAUM, A., HAMMARLING, S.,

AND SORENSEN, D. Prospectus for the development of a linear algebra library for highperformance computers. Tech. Rep. MCS-TM-97, Sept. 1987.

[52]DHILLON, S., PARLETT, N., AND VOMEL. The design and implementation of the MRRR algorithm. ACM Trans. Math. Softw. 32, 4 (2006), 533–560.

[53]DONGARRA, J., HAMMARLING, S. J., AND SORENSEN, D. C. Block reduction of matrices to

condensed forms for eigenvalue computations. LAPACK Working Note 2, Technical Report MCS-TM-99, Sept. 1987.

[54]DONGARRA, J., VAN DE GEIJN, R., AND WALKER, D. Scalability issues a ecting the design of a dense linear algebra library.

[55]DONGARRA, J., AND WALKER, D. Lapack working note 58: “the design of linear algebra libraries for high performance computers. Tech. rep., Knoxville, TN, USA, 1993.

[56]DONGARRA, J. J., BUNCH, J. R., MOLER, C. B., AND STEWART, G. W. LINPACK Users’ Guide. SIAM, Philadelphia, 1979.

[57]DONGARRA, J. J., DU CROZ, J., HAMMARLING, S., AND DUFF, I. A set of level 3 basic

linear algebra subprograms. ACM Trans. Math. Softw. 16, 1 (Mar. 1990), 1–17.

[58]DONGARRA, J. J., VAN DE GEIJN, R. A., AND WHALEY, R. C. Two dimensional basic

linear algebra communication subprograms. In Proceedings of the Sixth SIAM Conference on Parallel Processing for Scientific Computing (Mar. 1993).

[59]DU CROZ, J., MAYES, P., AND RADICATI, G. Factorization of band matrices using level 3 BLAS. LAPACK Working Note 21, Technical Report CS-90-109, July 1990.

[60]DUATO, J., IGUAL, F. D., MAYO, R., PENA˜ , A. J., QUINTANA-ORT´ı, E. S., AND SILLA, F.

An e cient implementation of gpu virtualization in high performance clusters. In Euro-Par Workshops (2009), pp. 385–394.

[61]ELMROTH, E., GUSTAVSON, F., JONSSON, I., AND KAGSTROM, B. Recursive blocked

algorithms and hybrid data structures for dense matrix library software. SIAM Review 46, 1 (2004), 3–45.

[62]ESSL, I. http://www-03.ibm.com/systems/p/software/essl/index.html.

225

BIBLIOGRAPHY

[63]FATAHALIAN, K., SUGERMAN, J., AND HANRAHAN, P. Understanding the e ciency of

gpu algorithms for matrix-matrix multiplication. In HWWS ’04: Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware (New York, NY, USA, 2004), ACM, pp. 133–137.

[64]FERNANDO, R., AND KILGARD, M. J. The Cg Tutorial: The Definitive Guide to Programmable Real-Time Graphics. Addison-Wesley, 2003.

[65]FOGUE, M., IGUAL, F. D., QUINTANA-ORTI, E. S., AND VAN DE GEIJN, R. A. Retargeting

PLAPACK to clusters with hardware accelerators. In International Conference on High Performance Computing and Simulation (HPCS 2010) (2010), pp. 444 –451.

[66]GALOPPO, N., GOVINDARAJU, N. K., HENSON, M., AND MANOCHA, D. LU-GPU: E cient

algorithms for solving dense linear systems on graphics hardware. In Proceedings of the 2005 ACM/IEEE conference on Supercomputing – SC2005 (Washington, DC, USA, 2005), IEEE Computer Society, p. 3.

[67]GIRKAR, M., AND POLYCHRONOPOULOS, C. D. Extracting task-level parallelism. ACM Trans. Program. Lang. Syst. 17, 4 (1995), 600–634.

[68]GODDEKE¨ , D., STRZODKA, R., MOHD-YUSOF, J., MCCORMICK, P. S., BUIJSSEN, S. H.,

GRAJEWSKI, M., AND TUREK, S. Exploring weak scalability for FEM calculations on a GPU-enhanced cluster. Parallel Computing 33, 10–11 (Sept. 2007), 685–699.

[69]GODDEKE¨ , D., STRZODKA, R., MOHD-YUSOF, J., MCCORMICK, P. S., WOBKER, H.,

BECKER, C., AND TUREK, S. Using GPUs to improve multigrid solver performance on a cluster. International Journal of Computational Science and Engineering 4, 1 (Nov. 2008), 36–55.

[70]GOLUB, G. H., AND LOAN, C. F. V. Matrix Computations, 3rd ed. The Johns Hopkins University Press, Baltimore, 1996.

[71]GOTO, K. http://www.tacc.utexas.edu/resources/software.

[72]GOTO, K., AND GEIJN, R. A. V. D. Anatomy of high-performance matrix multiplication.

ACM Trans. Math. Softw. 34, 3 (2008), 1–25.

[73]GRAMA, A., KARYPIS, G., KUMAR, V., AND GUPTA, A. Introduction to Parallel Computing (2nd Edition), 2 ed. Addison Wesley, January 2003.

[74]GROPP, W., LUSK, E., AND SKJELLUM, A. Using MPI. The MIT Press, 1994.

[75]GUNNELS, J. A., GUSTAVSON, F. G., HENRY, G. M., AND VAN DE GEIJN, R. A. Flame:

Formal linear algebra methods environment. ACM Transactions on Mathematical Software 27, 4 (Dec. 2001), 422–455.

[76]GUNNELS, J. A., HENRY, G. M., AND VAN DE GEIJN, R. A. Formal Linear Algebra Methods Environment (FLAME): Overview. FLAME Working Note #1 CS-TR-00-28, Department of Computer Sciences, The University of Texas at Austin, Nov. 2000.

[77]GUSTAVSON, F. G., KARLSSON, L., AND KAGSTROM, B. Three algorithms for Cholesky factorization on distributed memory using packed storage. In Applied Parallel Computing. State of the Art in Scientific Computing. Springer Berlin / Heidelberg, 2007, pp. 550–559.

226

BIBLIOGRAPHY

[78]HENDRICKSON, B. A., AND WOMBLE, D. E. The torus-wrap mapping for dense matrix calculations on massively parallel computers. 1201–1226.

[79]HENNESSY, J. L., AND PATTERSON, D. A. Computer Architecture: A Quantitative Approach. Morgan Kaufmann Pub., San Francisco, 2003.

[80]HENNESSY, J. L., AND PATTERSON, D. A. Computer Architecture, Fourth Edition: A Quantitative Approach. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2006.

[81]HOWARD, J., DIGHE, S., HOSKOTE, Y., VANGAL, S., FINAN, D., RUHL, G., JENKINS, D., WILSON, H., BORKAR, N., SCHROM, G., PAILET, F., JAIN, S., JACOB, T., YADA, S.,

MARELLA, S., SALIHUNDAM, P., ERRAGUNTLA, V., KONOW, M., RIEPEN, M., DROEGE, G., LINDEMANN, J., GRIES, M., APEL, T., HENRISS, K., LUND-LARSEN, T., STEIBL, S., BORKAR, S., DE, V., VAN DER WIJNGAART, R., AND MATTSON, T. A 48-core ia-32

message-passing processor with dvfs in 45nm cmos. Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2010 IEEE International (feb. 2010), 108 –109.

[82]IGUAL, F. D., MAYO, R., AND QUINTANA-ORT´ı, E. S. Attaining high performance in general-purpose computations on current graphics processors. High Performance Computing for Computational Science - VECPAR 2008: 8th International Conference, Toulouse, France, June 24-27, 2008. Revised Selected Papers (2008), 406–419.

[83]IGUAL, F. D., QUINTANA-ORT´ı, G., AND VAN DE GEIJN, R. Level-3 BLAS on a GPU:

Picking the low hanging fruit. In ICCMSE 2010: Proceedings of the Eighth International Conference of Computational Methods in Sciences and Engineering (2011), AIP Conference Proceedings. (To appear. Also published as FLAME Working Note 37).

[84]INO, F., MATSUI, M., GODA, K., AND HAGIHARA, K. Performance study of lu decomposition on the programmable gpu. In High Performance Computing - HiPC 2005, D. Bader, M. Parashar, V. Sridhar, and V. Prasanna, Eds., vol. 3769 of Lecture Notes in Computer Science. Springer Berlin. Heidelberg, 2005, pp. 83–94.

[85]INTEL MKL. http://software.intel.com/en-us/intel-mkl/.

[86]JIANG, C., AND SNIR, M. Automatic tuning matrix multiplication performance on graphics hardware. In PACT ’05: Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques (Washington, DC, USA, 2005), IEEE Computer Society, pp. 185–196.

[87]JUNG, J., AND O’LEARY, D. Cholesky decomposition and linear programming on a gpu.

Scholarly Paper University of Maryland (2006), 1–21.

[88]AGSTROM¨ , B., LING, P., AND LOAN, C. V. GEMM-based level 3 BLAS: High performance model implementations and performance evaluation benchmark. ACM Trans. Math. Softw. 24, 3 (1998), 268–302.

[89]KIRK, D. B., AND HWU, W.-M. W. Programming Massively Parallel Processors: A Handson Approach, 1 ed. Morgan Kaufmann, February 2010.

[90]KRUGER¨ , J., AND WESTERMANN, R. Linear algebra operators for gpu implementation of numerical algorithms. In SIGGRAPH ’03: ACM SIGGRAPH 2003 Papers (New York, NY, USA, 2003), ACM, pp. 908–916.

227

BIBLIOGRAPHY

[91]KURZAK, J., AND DONGARRA, J. Implementation of mixed precision in solving systems of linear equations on the cell processor: Research articles. Concurr. Comput. : Pract. Exper. 19, 10 (2007), 1371–1385.

[92]LANG, B. E cient eigenvalue and singular value computations on shared memory machines.

Parallel Comput. 25, 7 (1999), 845–860.

[93]LANGOU, J., LANGOU, J., LUSZCZEK, P., KURZAK, J., BUTTARI, A., AND DONGARRA,

J.Exploiting the performance of 32 bit floating point arithmetic in obtaining 64 bit accuracy (revisiting iterative refinement for linear systems). In SC ’06: Proceedings of the 2006 ACM/IEEE conference on Supercomputing (New York, NY, USA, 2006), ACM, p. 113.

[94]LARSEN, E. S., AND MCALLISTER, D. Fast matrix multiplies using graphics hardware. In

Proceedings of the 2001 ACM/IEEE conference on Supercomputing (CDROM) (New York, NY, USA, 2001), Supercomputing ’01, ACM, pp. 55–55.

[95]LOW, T. M., AND VAN DE GEIJN, R. An API for manipulating matrices stored by blocks. Tech. Rep. TR-2004-15, Department of Computer Sciences, The University of Texas at Austin, May 2004.

[96]LTAIEF, H., TOMOV, S., NATH, R., DU, P., AND DONGARRA, J. A scalable high performant cholesky factorization for multicore with gpu accelerators. Proc. of VECPAR’10 (to appear).

[97]LUNA, F. D. Introduction to 3D Game Programming with DirectX 10. Jones and Bartlett Publishers, 2008.

[98]MARKER, B. A., VAN ZEE, F. G., GOTO, K., QUINTANA-ORT´ı, G., AND VAN DE GEIJN,

R.A. Toward scalable matrix multiply on multithreaded architectures. In European Conference on Parallel and Distributed Computing – Euro-Par 2007 (February 2007), pp. 748–757.

[99]MARKOFF, J. Writing the fastest code, by hand, for fun: A human computer keeps speeding up chips. The New York Times (2005).

[100]MARQU´ES, M. Una Aproximacion de Alto Nivel a la Resolucion de Problemas Matriciales con Almacenamiento en Disco. PhD thesis, Universitat Jaume I, 2009.

[101]MARQU´ES, M., QUINTANA-ORT´ı, G., QUINTANA-ORT´ı, E. S., AND VAN DE GEIJN, R. A.

Using graphics processors to accelerate the solution of out-of-core linear systems. In ISPDC (2009), pp. 169–176.

[102]MARTIN, R. M. Electronic Structure: Basic Tehory and Practical Methods. Cambridge University Press, Cambridge, UK, 2008.

[103]MATTSON, T. G., VAN DER WIJNGAART, R., AND FRUMKIN, M. Programming the intel

80-core network-on-a-chip terascale processor. In SC ’08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing (Piscataway, NJ, USA, 2008), IEEE Press, pp. 1–11.

[104]MOLER, C. B. Iterative refinement in floating point. J. ACM 14, 2 (1967), 316–321.

[105]MOORE, G. Cramming more components onto integrated circuits. Proceedings of the IEEE 86, 1 (Jan. 1998), 82–85.

[106]NETLIB.ORG. http://www.netlib.org/blas.

228

BIBLIOGRAPHY

[107]NETLIB.ORG. http://www.netlib.org/eispack.

[108]NETLIB.ORG. http://www.netlib.org/lapack.

[109]PEREZ, J. M., BADIA, R. M., LABARTA, J., PEREZ, J. M., BADIA, R. M., AND LABARTA,

J.Technical report 03/2007 a flexible and portable programming model for smp and multicores bsc-upc, 2007.

[110]PEREZ, J. P., BELLENS, P., BADIA, R. M., AND LABARTA, J. Cellss: making it easier to program the cell broadband engine processor. IBM J. Res. Dev. 51, 5 (2007), 593–604.

[111]PERFORMANCE LIBRARY, S. http://developers.sun.com/sunstudio/.

[112]PETITET, A. P., AND DONGARRA, J. J. Algorithmic redistribution methods for block-cyclic decompositions. IEEE Trans. Parallel Distrib. Syst. 10, 12 (1999), 1201–1216.

[113]POULSON, J., MARKER, B., AND VAN DE GEIJN, R. Elemental: A new framework for distributed memory dense matrix computations. FLAME Working Note #44. Technical Report TR-10-20, The University of Texas at Austin, Department of Computer Sciences, June 2010.

[114]QUINTANA-ORT´ı, G., IGUAL, F. D., QUINTANA-ORT´ı, E. S., AND VAN DE GEIJN, R. A.

Solving dense linear systems on platforms with multiple hardware accelerators. In PPoPP

’09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming (New York, NY, USA, 2009), ACM, pp. 121–130.

[115]QUINTANA-ORT´ı, G., QUINTANA-ORT´ı, E. S., CHAN, E., VAN DE GEIJN, R., AND

VAN ZEE, F. G. Design and scheduling of an algorithm-by-blocks for LU factorization on multithreaded architectures. FLAME Working Note #26 TR-07-50, The University of Texas at Austin, Department of Computer Sciences, September 2007.

[116]QUINTANA-ORT´ı, G., QUINTANA-ORT´ı, E. S., CHAN, E., VAN DE GEIJN, R., AND

VAN ZEE, F. G. Design of scalable dense linear algebra libraries for multithreaded architectures: the LU factorization. In Workshop on Multithreaded Architectures and Applications – MTAAP 2008 (2008). CD-ROM.

[117]QUINTANA-ORT´ı, G., QUINTANA-ORT´ı, E. S., CHAN, E., VAN ZEE, F. G., AND VAN DE

GEIJN, R. A. Scheduling of QR factorization algorithms on SMP and multi-core architectures. In 16th Euromicro International Conference on Parallel, Distributed and Network-based Processing – PDP 2008 (2008), F. S. D. El Baz, J. Bourgeois, Ed., pp. 301–310.

[118]QUINTANA-ORT´ı, G., QUINTANA-ORT´ı, E. S., REMON´ , A., AND VAN DE GEIJN, R. Su-

permatrix for the factorization of band matrices. FLAME Working Note #27 TR-07-51, The University of Texas at Austin, Department of Computer Sciences, September 2007.

[119]QUINTANA-ORT´ı, G., QUINTANA-ORT´ı, E. S., REMON, A., AND VAN DE GEIJN, R. A.

An algorithm-by-blocks for supermatrix band cholesky factorization. In Eigth International Meeting on High Performance Computing for Computational Science (2008). submitted.

[120]QUINTANA-ORT´ı, G., QUINTANA-ORT´ı, E. S., VAN DE GEIJN, R., ZEE, F. V., AND CHAN,

E.Programming matrix algorithms-by-blocks for thread-level parallelism. ACM Transactions on Mathematical Software. To appear.

229

BIBLIOGRAPHY

[121]QUINTANA-ORT´ı, G., QUINTANA-ORT´ı, E. S., VAN DE GEIJN, R. A., VAN ZEE, F. G.,

AND CHAN, E. Programming matrix algorithms-by-blocks for thread-level parallelism. ACM Transactions on Mathematical Software 36, 3 (July 2009), 14:1–14:26.

[122]REINDERS, J. Intel threading building blocks. O’Reilly & Associates, Inc., Sebastopol, CA, USA, 2007.

[123]RICE, J. R. Matrix Computations and Mathematical Software. New York, NY, USA, 1981.

[124]SEBASTIEN, S.-L., AND WOLFGANG, E. The Complete E ect and HLSL Guide. Paradoxal Press, 2005.

[125]SNIR, M., OTTO, S. W., HUSS-LEDERMAN, S., WALKER, D. W., AND DONGARRA, J.

MPI: The Complete Reference. The MIT Press, 1996.

[126]SPARK. http://www.cs.utexas.edu/users/flame/Spark/.

[127]STRAZDINS, P. A comparison of lookahead and algorithmic blocking techniques for parallel matrix factorization. Tech. Rep. TR-CS-98-07, Department of Computer Science, The Australian National University, Canberra 0200 ACT, Australia, 1998.

[128]SUNDERAM, V. S. Pvm: a framework for parallel distributed computing. Concurrency: Pract. Exper. 2, 4 (1990), 315–339.

[129]SUSAN L. GRAHAM, MARC SNIR, C. A. P., Ed. Getting up to speed: the future of supercomputing. The National Academies Press, 2004.

[130]TOMASULO, R. An e cient algorithm for exploiting multiple arithmetic units. IBM J. of Research and Development 11, 1 (1967).

[131]TOMOV, S., DONGARRA, J., AND BABOULIN, M. Towards dense linear algebra for hybrid gpu accelerated manycore systems. Parallel Comput. 36 (June 2010), 232–240.

[132]TOMOV, S., NATH, R., AND DONGARRA, J. Accelerating the reduction to upper hessenberg, tridiagonal, and bidiagonal forms through hybrid gpu-based computing. Parallel Comput. 36 (December 2010), 645–654.

[133]TOMOV, S., NATH, R., LTAIEF, H., AND DONGARRA, J. Dense linear algebra solvers for multicore with gpu accelerators. Proc. of IPDPS’10 .

[134]TOP500.ORG. http://www.top500.org.

[135]VAN DE GEIJN, R., AND WATTS, J. SUMMA: Scalable universal matrix multiplication algorithm. Concurrency: Practice and Experience 9, 4 (April 1997), 255–274.

[136]VAN DE GEIJN, R. A. Representing linear algebra algorithms in code: The FLAME API.

ACM Trans. Math. Softw.. submitted.

[137]VAN DE GEIJN, R. A. Using PLAPACK: Parallel Linear Algebra Package. The MIT Press, 1997.

[138]VAN DE GEIJN, R. A., AND QUINTANA-ORT´ı, E. S. The Science of Programming Matrix Computations. www.lulu.com, 2008.

230

BIBLIOGRAPHY

[139]VOLKOV, V., AND DEMMEL, J. Lu, qr and cholesky factorizations using vector capabilities of gpus. Tech. rep.

[140]VOLKOV, V., AND DEMMEL, J. Using gpus to accelerate the bisection algorithm for finding eigenvalues of symmetric tridiagonal matrices. Tech. Rep. UCB/EECS-2007-179, EECS Department, University of California, Berkeley, Dec 2007.

[141]VOLKOV, V., AND DEMMEL, J. LU, QR and Cholesky factorizations using vector capabilities of GPUs. Tech. Rep. UCB/EECS-2008-49, EECS Department, University of California, Berkeley, May 2008.

[142]VOLKOV, V., AND DEMMEL, J. W. Benchmarking gpus to tune dense linear algebra. In

SC ’08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing (Piscataway, NJ, USA, 2008), IEEE Press, pp. 1–11.

[143]VON NEUMANN, J. First draft of a report on the EDVAC. IEEE Ann. Hist. Comput. 15, 4 (1993), 27–75.

[144]WATKINS, D. S. Fundamentals of Matrix Computations, 2nd ed. John Wiley and Sons, inc., New York, 2002.

[145]WHALEY, R. C., AND DONGARRA, J. J. Automatically tuned linear algebra software. In

Proceedings of SC’98 (1998).

[146]WU, Y.-J. J., ALPATOV, P. A., BISCHOF, C., AND VAN DE GEIJN, R. A. A parallel implementation of symmetric band reduction using PLAPACK. In Proceedings of Scalable Parallel Library Conference, Mississippi State University (1996). PRISM Working Note 35.

[147]WULF, W. A., AND MCKEE, S. A. Hitting the memory wall: implications of the obvious.

SIGARCH Comput. Archit. News 23, 1 (1995), 20–24.

[148]ZAFONT, M. J., MARTIN, A., IGUAL, F., AND QUINTANA-ORTI, E. S. Fast develop-

ment of dense linear algebra codes on graphics processors. In Proceedings of the 2009 IEEE

International Symposium on Parallel&Distributed Processing. Workshop on High-Level Parallel Programming Models & Supportive Environments (Washington, DC, USA, 2009), IEEE Computer Society, pp. 1–8.

[149]ZEE, F. G. V. libflame: The Complete Reference. www.lulu.com, 2009.

[150]ZHANG, Y., AND SARKAR, T. K. Parallel solution of integral equation-based EM problems in the frequency domain; electronic version. Wiley Series in Microwave and Optical Engineering. Wiley, Hoboken, NJ, 2009.

[151]ZHANG, Y., SARKAR, T. K., VAN DE GEIJN, R. A., AND TAYLOR, M. C. Parallel MoM us-

ing higher order basis function and PLAPACK in-core and out-of-core solvers for challenging EM simulations. In IEEE AP-S & USNC/URSI Symposium (2008).

231

BIBLIOGRAPHY

232

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]