Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

UnEncrypted

.pdf
Скачиваний:
11
Добавлен:
16.05.2015
Размер:
6.75 Mб
Скачать

Proceedings of the 12th International Conference on Computational and Mathematical Methods

in Science and Engineering, CMMSE 2012 July, 2-5, 2012.

An extension of the Ikebe algorithm for the inversion of Hessenberg matrices

J. Abderram´an Marrero1 and V. Tomeo2

1 Department of Mathematics Applied to Information Technologies, Telecommunication Engineering School, U.P.M. Technical University of Madrid, Spain

2 Department of Algebra, School of Statistics, U.C.M. Complutense University of Madrid,

Spain

emails: jc.abderraman@upm.es, tomeo@estad.ucm.es

Abstract

Ikebe algorithm for computing the lower half of the inverse of any (unreduced) upper Hessenberg matrix is extended here to compute the entries of the superdiagonal. It gives rise to an algorithm of inversion based on the factorization H1 = HL · U1. The lower Hessenberg matrix HL is a quasiseparable one and U1 is upper triangular, with diagonal entries ui,i = 1. Its computational complexity, O(n3), is connected with back substitution for the inversion of the matrix U. Moreover, the inverses of quasiseparable Hessenberg matrices are obtained in O(n2) times. Numerical comparisons with other specialized algorithms of inversion are also introduced.

Key words: Computational complexity, Hessenberg matrix, Ikebe algorithm, inverse matrix, matrix factorization

MSC 2000: 15A09, 15A15, 15A23, 65F05, 65Y20

1An algorithm for computing and factorizing the inverses of unreduced Hessenberg matrices

Hessenberg matrices have a main role in numerical linear algebra, in particular for the eigenvalue problem of a general matrix. Furthermore the search of fast and simple algorithms for the inverses of such structured matrices is of current interest. Ikebe algorithm [6], yields the entries of the upper half of the inverse of any (unreduced) lower Hessenberg matrix with O(n2) complexity. Two algorithms with O(n3) complexity have been recently

c CMMSE

Page 23 of 1573

ISBN:978-84-615-5392-1

An algorithm for the inversion of Hessenberg matrices

introduced, [3, 4], for computing the inverse matrix and the determinant of (unreduced) lower Hessenberg matrices.

Our aim here is to propose a procedure to compute the factorization H1 = HL · U1, of the inverse of an unreduced Hessenberg matrix H. Matrix HL is a quasiseparable one, i.e. rank (HL(i + 1 : n, 1 : i)) 1, rank (HL(1 : i, i + 1 : n)) 1, i = 1, 2, ..., n − 1; e.g. see [2]. Matrix U is an upper triangular matrix, with ones on its main diagonal. We take upper Hessenberg matrices without loss of generality. The lower Hessenberg matrix HL is directly obtained by a simple extension of Ikebe algorithm to the superdiagonal entries of H1.

We recall the Ikebe algorithm, adapted here to an (unreduced) upper Hessenberg matrix

H of order n. It provides the lower half of the inverse matrix H1, i.e. h(i,j1) with i ≥ j. Following [6], we have

hi,j(1) = y(i) · x(j);

i ≥ j,

(1)

where y(i) and x(j) are the components ith and jth of the vectors y and x, respectively. The components of vector x were achieved in the following recursive way, with hj,j11 = 1/hj,j−1,

x(1)

=

λ = 0

(an arbitrary constant),

 

 

 

 

j−1

 

 

 

 

x(j)

=

h1

 

 

x(k)

(j = 2, 3, ..., n).

(2)

h

k,j−1

 

 

j,j−1

 

 

 

 

k=1

The entries of vector y were also given by the following recurrence,

y(n)

=

n

hk,nx(k) 1

,

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

k=1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

n

 

 

 

 

 

 

 

 

 

y(i)

=

h1

 

k

 

 

y(k)

(i = n

1, n

2, ..., 1).

(3)

 

h

i+1,k

 

 

i+1,i

 

 

 

 

 

 

=i+1

Now we compare equations (2)-(3) with the closed representation for the entries of the inverses of upper Hessenberg matrices; see [1], Corollary 1. Therefore, we can extend the Ikebe algorithm to obtain the entries of the superdiagonal, h(i,i+11) , i = 1, ..., n − 1, of H1.

Proposition 1 The entries for the superdiagonal of the inverse of an (unreduced) upper Hessenberg matrix H can be represented as

h(1)

= y(i)

·

x(i + 1) + h1

;

1

i

n

1,

(4)

i,i+1

 

i+1,i

 

 

 

 

 

 

where y(i) and x(i + 1) given by (3) and (2), respectively, are obtained from the Ikebe algorithm.

c CMMSE

Page 24 of 1573

ISBN:978-84-615-5392-1

J. Abderraman´ Marrero, V. Tomeo

In addition, we can recover from y(n) the value of det H, the determinant of the matrix H, with the convention det H(0n) = 1,

n−1 ( n=2 hk,k−1)

det H = (1) k . λ · y(n)

Taking into consideration that the resulting matrix HL contains the lower half plus the superdiagonal of the inverse matrix H1, and H·H1 = In, we obtain the product H·HL = U, an upper triangular matrix with ones in its main diagonal. Therefore both matrices HL and U are nonsingular. It gives rise to our main result; a particular factorization for the inverses of (nonsingular) upper Hessenberg matrices.

Theorem 1 Let H be a nonsingular matrix of order n. Then the following statements are equivalent:

1. H is an upper Hessenberg matrix.

2. The inverse matrix H1 has a factorization of the form H1 = HL · U1, where the lower Hessenberg matrix HL is a quasiseparable matrix, and U1 is an upper triangular matrix with ones in its main diagonal.

An equivalent result can be proposed for (nonsingular) lower Hessenberg matrices.

From these results, since in the unreduced case matrix HL is obtained from the Ikebe expanded algorithm, we propose a constructive method for computing and factorizing the inverses of unreduced Hessenberg matrices. Note as its computational complexity, O(n3) for general nonsingular Hessenberg matrices, is connected with back substitution for the inversion of the upper triangular matrix U, with ui,i = 1 (1 ≤ i ≤ n).

2Numerical examples

To begin with the numerical examples, we use Matlab commercial package in a computer of 1.80 GHz. First, we handle a quasiseparable upper Hessenberg matrix with an order n = 5, transpose of a customary example given in [3, 4]. Its entries are hij = 1 for 1 ≤ i ≤ j ≤ 5, hij = 1 for 2 ≤ i = j + 1 5, and hij = 0, otherwise. Our constructive procedure gives the outcomes for the entries of the inverse in O(n2) times. Nevertheless, the algorithms from [3, 4] provide results in O(n3) times.

>> H^(-1) = 0.5000

-0.5000

0

0

0

0.2500

0.2500

-0.5000

0

0

0.1250

0.1250

0.2500

-0.5000

0

0.0625

0.0625

0.1250

0.2500

-0.5000

0.0625

0.0625

0.1250

0.2500

0.5000

c CMMSE

Page 25 of 1573

ISBN:978-84-615-5392-1

An algorithm for the inversion of Hessenberg matrices

Order

Elouafi - Hadj

Chen - Yu

Ikebe expanded

15

1.75e-13

1.67e-13

1.68e-14

35

7.37e-10

1.54e-10

5.34e-14

55

2.45e-06

1.31e-06

8.65e-14

75

1.36e-02

5.58e-03

2.57e-13

95

4.78e+01

2.62e+01

1.49e-13

115

1.02e+05

6.15e+04

2.57e-13

135

3.09e+08

2.29e+08

7.21e-13

155

2.55e+12

1.02e+12

2.03e-12

Table 1: Values of norm(H1H I) for the accuracy of the three algorithms in the computation of the inverse of a quasiseparable Hessenberg matrix.

To check the accuracy, we compare in Table 1 the outcomes for the norm(H1H I), with I the identity matrix, in the inversion of a Hessenberg matrix, with entries hi,j = 1 for i = j + 1, hi,j = 2.5 for i ≤ j, and hi,j = 0, otherwise. Our procedure also gives the outcomes in O(n2) times. Algorithms given in [3, 4] produce inaccurate outcomes in O(n3) times, for the entries of the inverse matrix of this quasiseparable Hessenberg matrix.

When handling more general Hessenberg matrices the three algorithms given outcomes for the inverses in O(n3) times. However, our procedure produces shorter elapsed times with respect to the given by the algorithms from [3, 4].

References

[1]J. Abderraman´ Marrero and V. Tomeo, On the closed representation for the inverses of Hessenberg matrices, J. Comp. Appl. Math. 236 (2012) 2962–2970.

[2]R. Bevilaqua, E. Bozzo, and G. M. Del Corso, qd-type methods for quasiseparable matrices, SIAM J. Matrix Anal. Appl. 32 (2011) 722–747.

[3]Y. H. Chen and C. Y. Yu, A new algorithm for computing the inverse and the determinant of a Hessenberg matrix, Appl. Math. Comput. 218 (2011) 4433–4436.

[4]M. Elouafi and A.D. Aiat Hadj, A new recursive algorithm for inverting Hessenberg matrices, Appl. Math. Comput. 214 (2009) 497–499.

[5]G. H. Golub and C. F. Van Loan, Matrix Computations, third ed., Johns Hopkins University Press, Baltimore, Maryland, USA, 1996.

[6]Y. Ikebe, On inverses of Hessenberg matrices, Linear Algebra Appl. 24 (1979) 93–97.

c CMMSE

Page 26 of 1573

ISBN:978-84-615-5392-1

Proceedings of the 12th International Conference on Computational and Mathematical Methods in Science and Engineering, CMMSE 2012 July, 2-5, 2012.

Skeletal based programming for Dynamic Programming on GPUs

Alejandro Acosta1 and Francisco Almeida1

1 Dpt. Statistics and Computer Science, La Laguna University. Spain.

emails: aacostad@ull.com, falmeida@ull.com

Abstract

Current parallel systems composed of mixed multi/manycore systems and GPUs become more complex due to their heterogeneous nature. The programmability barrier inherent to parallel systems increases almost with each new architecture delivery. The development of libraries, languages and tools that allow an easy and e cient use in this new scenario is mandatory. Among the proposals found to broach this problem, skeletal programming appeared as a natural alternative to easy the programmability of parallel systems in general but also the GPU programming in particular. In this paper we develop a programming skeleton for Dynamic Programming on GPUs. The skeleton, implemented in CUDA, allows the user to execute parallel codes for GPUs just by providing sequential C++ specifications of her problems. The performance and easy of use of this skeleton has been tested on several optimization problems. The experimental results obtained over a Nvidia Fermi prove the advantages of the approach.

Key words: Skeleton, GPU, Dynamic programming.

1Introduction

Today’s generation of computers is based on an architecture with identical multiple processing units consisting of several cores (multicores). The number of cores per processor is expected to increase every year. It is also well known that the current generation of compilers is incapable of automatically exploiting the ability this architecture a ords applications.

The situation is further complicated by the fact that current architectures are heterogeneous by nature, which o ers the possibility of combining these multicore with GPU-based systems, for example, in a general purpose architecture. The programmability of these systems, however, poses a barrier that hampers e cient access to its exploitation.

c CMMSE

Page 27 of 1573

ISBN:978-84-615-5392-1

Skeletal based programming for Dynamic Programming on GPUs

Many proposals have been put forth to facilitate the job of programmers. Leaving aside proposals based on the development of new programming languages due to the e ort this represents for the user (e ort to learn and reuse code), the remaining proposals are based on transforming sequential code into parallel code, or on transforming parallel code designed for one architecture into parallel code designed for another.

Skeletal programming for GPUs on domain specific applications have been provided by several authors, Patus [2] for stencil computations or Delite [1] for Machine Learning problems among others. More general approaches have been presented by SkelCL [5] or SkePU [3] to make easier the programming of GPU architectures.

In this paper also we propose the use of skeletal based programming to exploit GPUs. An advantage of the paradigm is that the user provides sequential specifications of her problem and the skeleton implements the parallelization of the algorithm to solve it. We instantiate the method over the Dynamic Programming technique. In [4] we proposed the use of DPSKEL skeletons to o set the dearth of general software dynamic programming (sequential and parallel) tools. Our aim was to bridge the obvious gap existing between general methods and DP applications. The goal of DPSKEL is to minimize the user e ort required to work with the tool by conforming as much as possible to the use of standard methodologies. In this paper we have expanded the original version of DPSKEL to adapt it to new architectures. On this occasion we developed the solution engine for GPU architecture using CUDA. The proposed implementation shows several advantages, it allows the easy development and fast prototyping of Dynamic Programming problems on GPUs since it hides the parallel traversing of the Dynamic Programming Table and also hides the di culty of CUDA programming. Another advantage is that the skeleton can be adapted to changes in the architecture and to the programming interface without altering the Dynamic Programming code for the specific problem provide by the user. As a proof of the easy of use of our tool, four combinatorial optimization problems have been instantiated: the 0/1 Knapsack problem, the Resource Allocation problem, the Triangulation of Convex Polygons problem and the Guillotine Cutting Stock problem. Computational results over a Nvidia Fermi C2050 have been provided for all test problems and a comparative analysis of the performance when compared with shared memory skeletons.

The paper is structured as follows, we present the GPU skeleton developed and its software architecture in section 2, and section 3 describes the expansive computational experiment undertaken as a result of applying the method developed. The ease of development and the increase in productivity are substantial. We conclude the paper with section 4, in which we outline the key findings and propose futures lines of research.

c CMMSE

Page 28 of 1573

ISBN:978-84-615-5392-1

Alejandro Acosta, Francisco Almeida

2A GPU Skeleton for Dynamic Programming

As stated in DPSKEL [4] developing software skeletons for DP implies analysing and determining those elements that can be extracted from a specific case and whose elements depend on the application. Assuming that the user is capable of obtaining the functional equations herself, in DPSKEL the user provides the structure of a state and its evaluation through the functional equations, and the DP table is abstracted as a state table. DPSKEL provides the table and several methods for accessing it during the state evaluation process. These methods allow for di erent traversing (by rows, columns, diagonally), with the user choosing the best one based on the dependencies of the functional equations. In the sequential case, the traversing chosen for the table indicates that the states of the row (column or diagonal, respectively) will be processed sequentially, while in the GPU case, a set of rows (columns or diagonals, respectively) will be assigned to a set of threads to be processed simultaneously. This approach allows us to introduce any of the algorithm parallelization strategies devised for DP.

DPSKEL adheres to the classes model described in figure 1. The concepts of State and Decision are abstracted to the user in C++ classes (required). The user describes the problem, solution and methods for evaluating the state (the functional equation) and for obtaining the optimal solution. DPSKEL provides the classes (provided) for assigning and evaluating the DP table, making available the methods needed to yield the solution. The implementation details are hidden from the user. The initial versions featured solution engines for managing the sequential and parallel executions on shared and distributed platforms. In this paper we have developed the engines for GPU systems. Each solution engine implements di erent ways of accessing the DP table. We will now present some basic classes in DPSKEL.

 

 

Figure 1: DPSkel classes model

c CMMSE

Page 29 of 1573

ISBN:978-84-615-5392-1

Skeletal based programming for Dynamic Programming on GPUs

To develop the proposed GPU skeleton, we have used the CUDA SDK. We have chosen CUDA (instead of OpenCL) since it allows to execute C++ code inside a kernel, what enables the use the model of classes imposed in DPSKEL. We evaluated the use of OpenCL, however current versions are restricted to kernels implemented in C. That would impose a new design of the C++ DPSKEL tool and some lose of the generality provided by the object oriented programming.

2.1The State class

The State class holds the information associated with a DP state. This class stores and calculates the optimal value in the state and the decision associated with the optimal evaluation. The evaluation of a state implies accessing information on other states in the DP table. DPSKEL provides an object of the Table class hidden in each instance of the Solver class.

Listing 1: Definition of the State class. Implementation of the Evalua method of the State class for the 0/1 knapsack problem.

1 v o i d State : : Evalua ( i n t stage , i n t index ) {

2Decision dec ;

3

i n t

val ;

4

 

( index < w [ stage ] ) {

5

i f

6

 

val = 0 ;

7

 

dec . setDecision ( 0 ) ;

8}

9

e l s e

i f

( stage == 0) {

10

val =

( p [ stage ] ) ;

11

dec . setDecision ( 1 ) ;

12

}

{

 

13

e l s e

max ( table−>getState ( ( stage 1) , index ) , 0 ,

14

val =

15

 

 

( table−>getState ( stage 1, indexw [ stage ])+ p [ stage ] ) ,

16

}

 

1 , dec ) ;

17

 

 

18

 

 

 

19

setValue ( val ) ;

20

setDecision ( dec ) ;

21

i f ( ( stage == sol−>getRowSol ( ) ) && ( index == sol−>getColSol ( ) ) )

22

sol−>setSolucion ( t h i s ) ;

23

}

 

 

The code shown in Listing 1 defines the state class for the knapsack problem. A problem (pbm), a solution (sol), a decision (d) and the DP table (table) are defined. These variables can be regarded as generic variables, since they must be present in any problem to be solved. The value variable stores the optimal profit. We should mention a particular method in this class, the Evalua method, which implements the functional equation. The Evalua function

c CMMSE

Page 30 of 1573

ISBN:978-84-615-5392-1

Alejandro Acosta, Francisco Almeida

receives the indices for a state from the DP table. Any of the recurrences in Table 1 can be expressed with this prototype. If the functional equation for a specific problem requires a di erent prototype, the skeleton is open to method overloading using the polymorphism present in C++.

Listing 2 shows how the attribute host device is added in the header to allow the methods of this class to be executed both in the GPU and in the host system.

Listing 2: Header of the State class.

1 requires c l a s s State {

2i n t _value ;

3Decision decision ;

4

Problem

pbm ;

5

Solution

sol ;

6

Table table ;

7p u b l i c :

8

__host__

__device__

v o i d

init ( Problem pbm ,

Solution sol , Table table ) ;

9

__host__

__device__

v o i d

Evalua ( i n t stage ,

i n t index ) ;

10. . .

11} ;

2.2The Table class

The class Table holds the set of States that configure the problem. Each entry in the DP table stores all of the information associated with a state. It holds methods to get (getState(i,j)) and put (putState(i,j)) states from the table. The class Solver takes charge of building the table at the beginning of the execution. Note that to create the table, we used the Unified Virtual Addressing (UVA) provided by CUDA, that unifies the system memory and GPU memory into a single address space. We experimentally tested the benefits of using the UVA for our skeleton, better performances and an important increase of the size of the problem. Listing 3 shows the method to allocate the memory for the Dynamic Programming table and to initialize the states involved in it.

Listing 3: Definition of the Table class. Initialized method.

1

v o i d Table : : init ( c o n s t Setup &setup ,

Problem pbm , Solution sol ){

2

NumStages = setup . getNumStages ( ) ;

 

3NumStates = setup . getNumStates ( ) ;

4

5cudaMallocHost(&cTABLE , Num_Stages Num_States s i z e o f ( State ) ) ;

6

f o r ( i n t i=0; i<Num_Stages Num_States ; i++) {

7cTABLE [ i ] . init ( pbm , sol , t h i s ) ;

8 }

9}

c CMMSE

Page 31 of 1573

ISBN:978-84-615-5392-1

Skeletal based programming for Dynamic Programming on GPUs

2.3The Solver class

The solver class provides solution engines for di erent platforms. This class contains the data structures and methods needed to carry out a DP execution in keeping with the specifications. In practice, this is a virtual class, with the solver classes provided being defined as a sub-class of this main class. To the already known solution algorithms in DPSKEL:

Solver seq. A sequential solver.

Solver OpenMP. A solver for shared-memory systems.

Solver MPI. A solver for distributed-memory systems.

Solver hybrid. A solver for hybrid distributed and shared memory systems.

Solver heterogen. A solver for heterogeneous environments.

in the current design, we added a new solver for the GPU:

Solver cuda. A solver for gpu systems.

When a Solver class object is instanced, it is created dynamically in the DP table according to the configuration parameters. Listing 4 shows how the DP table is accessed by the runByRows method of the Solver cuda class. First, we obtain the number of threads to solve the problem, in this case a thread per collumn of the table is generated. Next the sequential traversing of the table is performed calling the kernel (kernelRun, see Listing 5). This kernel takes chare of calling the methods to evaluate a row that have been provided by the end user. Each one of the threads evaluates a state using the Evalua method supplied by the user. Several Kernels implementing di erent traversing modes have been implemented:

KernelrunByRows. Traverses the DP table by rows, one thread per column.

KernelrunByDiag. Traverses the DP table by diagonally, starting from the main diagonal upward.

KernelrunByDiag2. Traverses the DP table by diagonally, downward until the secondary diagonal.

As usual in skeletons, the proposed implementation shows several advantages, the parallelism is hidden, and allows the end user to express her problem as a sequential code, moreover it also hides the complexity of the CUDA programming. The user just implements her problem using sequential C++ and added the headers that enable the methods to be used by the GPU. Another important advantage is that new changes in the architecture and in the programming interface, can be faced by adapting the skeleton without introducing any change in the code provided by the end user.

c CMMSE

Page 32 of 1573

ISBN:978-84-615-5392-1

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]