Скачиваний:
50
Добавлен:
20.06.2019
Размер:
50.48 Mб
Скачать

188

Fig. 11.5Combined grid-cloud security architecture

S. Ostermann et al.

Security

 

1

 

2

GSI

request and

 

deployment

 

 

 

release

MyCloud

request

 

 

functions

 

 

 

 

 

 

 

 

MyInstance

generate Keypair,

3, 5

4 store private Key

 

Clouds

 

start instance

 

 

 

 

Management

 

1.A GSI-authenticated request for a new image deployment is received.

2.The security component checks in the MyCloud repository for the Clouds for which the user has valid credentials.

3.A new credential is generated for the new instance that needs to be started. In case multiple images need to be started, the same instance credential can be used to reduce the credential generation overhead (about 6–10 s in our experiments, including the communication overhead).

4.The new instance credentials are stored in the MyImage repository, which will only be accessible to the enactment engine service for job execution after proper GSI authentication.

5.A start instance request is sent to the Cloud using the newly generated instance credential.

6.When an instance is released, the resource manager deletes the corresponding credential from the MyInstance repository.

11.4  Evaluation

We extended the ASKALON enactment engine to consider our Cloud extensions by transferring files and submitting jobs to Cloud resources using the SCP/SSH provider of the Java CoG kit [23]. Some technical problems with these providers of the CoG kit required us to change the source code and create a custom build of the library to allow seamless and functional integration into the existing system.

For our experiments, we selected a scientific workflow application called Wien2k [24], which is a program package for performing electronic structure calculations of solids using density functional theory based on the full-potential (linearized) augmented plane-wave ((L)APW) and local orbital (lo) method. The Wien2k Grid workflow splits the computation into several course-grain activities,

11  Resource Management for Hybrid Grid and Cloud Computing

189

the work distribution being achieved by two parallel loops (second and fourth) consisting of a large number of independent activities calculated in parallel.

The number of sequential loops is statically unknown. We have chosen a problem case (called atype) that we solved using 193 and 376 parallel activities, and a problem size of 7.0, 8.0, and 9.0, which represents the number of planewaves that is equal to the size of the eigenvalue problem (i.e. the size of the matrix to be diagonalized) referenced as problem complexity in this work.

Figure 11.6 shows on the left the UML representation of the workflow that can be executed with ASKALON, and on the right, a concrete execution directed acyclic graph (DAG) showing one iteration of the while loop and four parallel activities in the parallel sections. The workflow size is determined at runtime as the parallelism is calculated by the first activity, and the last activity generates the result, which helps decide if the main loop is executed again or the result reaches the specified criteria.

We executed the workflow on a distributed testbed summarized in Table 11.3, consisting of four heterogeneous Austrian Grid sites [25] and 12 virtual CPUs from an “academic Cloud” called dps.cloud built using the Eucalyptus middleware [6] and the XEN virtualization mechanism [7]. We configured the dps.cloud resource classes to use one core, while multi-core configurations were prohibited by a bug in the Eucalyptus software (planned to be fixed in the next released). We fixed the

 

 

false

 

 

 

 

first

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

true

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

<<Activity>>

 

 

 

 

 

 

 

 

 

 

 

 

 

first

 

 

 

 

 

 

 

 

 

 

 

 

 

second

 

second

 

 

second

 

second

 

 

 

 

 

 

 

 

 

 

<<ParallelFor>>

pforLAPW1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

<<Activity>>

 

 

 

 

 

 

 

 

 

 

 

 

 

second

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

third

 

 

 

 

 

 

<<Activity>>

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

third

 

 

 

 

 

 

 

 

 

 

 

 

pforLAPW2

 

 

 

 

 

 

 

 

<<ParallelFor>>

fourth

 

fourth

 

 

fourth

 

 

fourth

 

 

<<Activity>>

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

fourt

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

<<Activity>>

 

 

 

 

last

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

last

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Fig. 11.6The Wien2k workflow in UML (left) and DAG (right) representation

190

S. Ostermann et al.

Table 11.3Overview of resources used from the grid and the private cloud for workflow execution

Grid site

Location

Cores used

CPU type

GHz

Mem/core

 

 

 

 

 

 

karwendel

Innsbruck

12

Opteron

2.4

1,024 mb

altix1.uibk

Innsbruck

12

Itanium

1.4

1,024 mb

altix1.jku

Linz

12

Itanium

1.4

1,024 mb

hydra.gup

Linz

12

Itanium

1.6

1,024 mb

dps.cloud

Innsbruck

12

Opteron

2.2

1,024 mb

Table 11.4Wien2K execution time and cost analysis on the Austrian grid with and without cloud resources for different number of parallel activities and problem sizes

 

 

 

Grid +

Speedup

Used

 

Paid

 

 

 

Parallel

Problem

Grid

cloud

using

instances

instances

 

$/T $/

 

 

 

 

 

 

activities

complexity

execution

execution

Cloud

Hours

$

Hours

$

 

min

 

193

Small (7.0)

874.66

803.66

1.09

2.7

0.54

12

2.04

1.72

193

Medium (8.0)

1,915.41

1218.09

1.57

4.1

0.82

12

2.04

0.18

193

Big (9.0)

3,670.18

2193.79

1.67

7.3

1.46

12

2.04

0.08

376

Small (7.0)

1,458.92

1275.31

1.14

4.3

0.86

12

2.04

0.67

376

Medium (8.0)

2,687.85

2020.17

1.33

6.7

1.34

12

2.04

0.18

376

Big (9.0)

5,599.67

4228.90

1.32

14.1

2.81

24

4.08

0.17

 

 

 

 

 

 

 

 

 

 

 

 

machine size of each Grid site to 12 cores to eliminate the variability in the resource availability and make the results across different experiments comparable.

We used a just-in-time scheduling mechanism that tries to map each activity onto the fastest available Grid resource. Once the Grid becomes full (because the size of the workflow parallel loops is larger than the total number of cores in the testbed), the scheduler starts requesting additional Cloud resources for executing, in parallel, the remaining workflow activities. Once these additional resources are available, they will be used to link Grid resources with different job submission methods.

Our goal was to compare the workflow execution for different problem sizes on the four Grid sites, with the execution using the same Grid environment supplemented by additional Cloud resources from dps.cloud. We executed each workflow instance five times and reported the average values obtained. The runtime variability in the Austrian Grid was less than 5%, because the testbed was idle during our experiments and each CPU was dedicated to running its activity with no external load or other queuing overheads.

Table11.4 shows the workflow execution times for 376 and 193 parallel activities in six different configurations. The small, medium, and big configuration values represent a problem size parameter that influences the execution time of the parallel activities. The improvement in using Cloud resources when compared with using only the four Grid sites increases from a small 1.08 speedup for short workflows with 14-min execution time, to a good 1.67 speedup for large workflows with 93-min execution time. The results show that a small and rather short workflow does not benefit much from the Cloud resources due to the high ratio between the smaller

11  Resource Management for Hybrid Grid and Cloud Computing

191

computation and the high provisioning and data transfer overheads. The main bottleneck­ when using Cloud resources is that the provisioned single core instances use separate file systems that require separate file transfers to start the computation. In contrast, Grid sites are usually parallel machines that share one file system across a larger number of cores, which significantly decreases the data transfer overheads. Nevertheless, for large problem sizes, the Cloud resources can help to significantly shorten the workflow completion time in case Grids become overloaded.

Table 11.5 gives further details on the file transfer overheads and the distribution of activity instances between the pure Grid and the combined Grid-Cloud execution. The file transfer overhead can be reduced by increasing the size of a resource class (i.e. number of cores underneath one instance, which share a file system and the input files for execution), which may result in a lower resource allocation efficiency as the resource allocation granularity increases. We plan to investigate this tradeoff in future work.

To understand and quantify the benefit and the potential costs of using commercial Clouds for similar experiments (without running the Wien2k workflows once again because of cost reasons), we executed the LINPACK benchmark [26] that measures the GFlop sustained performance of the resource classes offered by three Cloud providers: Amazon EC2, GoGrid (GG), and our academic dps.cloud (see Table 11.1). We configured LINPACK to use the GotoBLAS linear algebra library (one of the fastest implementations on Opteron processors in our experience) and MPI Chameleon [27] for instances with multiple cores. Table 11.6 summarizes the results that show the m1.large EC2 instance as being the closest to the dps.cloud, assuming that the two cores are used separately, which indicates an approximate realistic cost of $0.20 per core hour. The best sustained performance is offered by GG; however, it has extremely large resource provisioning latencies

Table 11.5Grid versus cloud file transfer and activity instance distribution­ to grid and cloud resources [t]

Parallel

File transfers

 

 

Activities run

 

 

 

 

 

 

activities

Total

To grid

To cloud

 

Total

On cloud

376

2,013

1,544

469 (23%)

759

209 (28%)

193

1,127

778

349 (31%)

389

107 (28%)

 

 

 

 

 

 

 

Table 11.6Average LINPACK sustained performance and resource provisioning latency results of various resource classes (see Table 11.1)

 

dps.

 

 

 

 

 

 

 

Instance

cloud

m1.smallm1.large

m1.xl

c1.medium

c1.xl

GG.1gig GG.4gig

 

 

 

 

 

 

 

 

 

Linpack

4.40

1.96

7.15

11.38

3.91

51.58

8.81

28.14

(GFlops)

 

 

 

 

 

 

 

 

Number of cores

1

1

2

4

2

8

1

3

GFlops per core

4.40

1.96

3.58

2.845

1.955

6.44

8.81

9.38

Speedup to dps

1

0.45

1.63

2.58

0.88

11.72

2.00

6.40

Cost [$ per hour]

0 (0.17)

0.085

0.34

0.68

0.17

0.68

0.18

0.72

Provisioning time

312

83

92

65

66

66

558

1,878

[s]

 

 

 

 

 

 

 

 

Соседние файлы в папке CLOUD COMPUTING