Скачиваний:
50
Добавлен:
20.06.2019
Размер:
50.48 Mб
Скачать

13  Mixing Grids and Clouds: High-Throughput Science Using the Nimrod Tool Family

229

In preparation­ for this, the EC2 extension is capable of launching a proxy (with an SSH tunnel back to the root) within a cloud, and agents will then collect work, etc., via the Nimrod proxy.

13.3.3  Additions to the Schedulers

The Nimrod actuator acts on commands scheduled to it by the agent scheduler, which is responsible for examining pending jobs and scheduling agents to consume them. The agent scheduler determines initialisation necessary for a computational resource to run Nimrod agents, when new agents are needed, and when agents should be stopped.

Previously, the default agent scheduler in Nimrod scheduled agents one at a time, in a similar fashion to how they could be launched using external middleware. However, there are now computational middleware standards (e.g. DRMAA [24]) and non-standard interfaces (e.g. to commercial cloud systems such as EC2) in common use that make it possible to request multiple slots or leases at once, and in some cases this can improve provisioning performance. This necessitated changes to the agent scheduler to enable it to queue multiple agents in a single transaction.

Previously, the job scheduler had no notion of accommodating the kind of dedicated resource capacity presented by machine instances in the cloud. In order to ensure that we fully utilise each machine (e.g. avoiding running one uni-processor job on a multi-core machine), it was necessary to alter the job scheduler to ensure that it allocated jobs, where possible, in multiples of slots-per-instance.

13.4  A Case Study in High-Throughput Science

and Economic Scheduling

In this section, we present a typical Nimrod experiment, along with domain background, as a case study to demonstrate the utility of the Nimrod EC2 extension. We discuss how the Nimrod EC2 extension might be used to improve scientific results and/or meet a deadline. Further, we give economic scheduling and execution data from a scaled version of the original experiment, and provide a cost analysis of the full version.

The research discussed in this case study uses Bayesian statistics for training a predictive model within a recommender system for the museum domain (Section 4.1). Importantly, the computational technique being used – a Markov chain Monte Carlo (MCMC) approach – is common to other fields where multi-dimensional integration arises (e.g. computational physics, computational biology and computational linguistics). Hence, the discussed example applies to a broad range of computational problems, and demonstrates what can be achieved in other domains with a similar structure.

230

B. Bethwaite et al.

13.4.1  Introduction and Background

This case study concerns techniques for automatically recommending exhibits to museum visitors, based on non-intrusive observations of their movements in the physical space. In general, recommender systems help users find personally interesting items in situations where the amount of available information is excessive [25]. Employing recommender systems in our scenario is challenging, as predictions differ from recommendations (we do not want to recommend exhibits that visitors are going to see anyway). We address this challenge by (1) using a Gaussian Spatial Process Model (SPM) to predict a visitor’s interests in exhibits [26], (2) calculating a prediction of a visitor’s pathway through the museum [27] and (3) combining these models to recommend personally interesting exhibits that may be overlooked if the predicted pathway is followed.

13.4.2  Computational Requirements

SPM has 2n + 3 model parameters (n is the number of exhibits), which need to be estimated from the observed visit trajectories. To achieve this, we opted for a Bayesian solution. Unfortunately, the integrations required to calculate the parameters’ posterior distribution are generally not tractable in closed form. However, the posterior can be approximated numerically using computationally demanding MCMC integration methods, such as the Metropolis-Hastings algorithm and the Gibbs sampler. Following Banerjee et al. [28], we use a slice Gibbs sampler [29] to sample from the posterior distribution. This approach is favourable, because it does not require tuning that is tailored to the application (hence, providing an automatic MCMC algorithm for fitting Gaussian spatial process models). In our case, we used every twentieth generated MCMC sample (to reduce positive autocorrelation between samples) after a burn-in phase of 1,000 iterations, and stopped the sampling procedure after 8,000 iterations. Thus, in total, this procedure provided 350 samples from the posterior.

The statistical quality of the parameter estimates derived from the MCMC samples increases with the number of MCMC iterations. Figure 13.2 depicts the standard error of the posterior mean estimate of one of SPM’s parameters as a function of the number of iterations. Decreasing the standard error by a factor of 10 requires 100 times as many samples. Thus, decreasing the standard error becomes increasingly expensive (the relationship is quadratic). Interestingly, because MCMC sampling is linear in time, we see a direct relationship between computation time (measured in MCMC iterations) and the statistical quality of the parameter estimates. This relationship between estimation accuracy and computational time is a commonly recurring theme in computational modelling.

Employing an MCMC approach for model training is computationally expensive in its own right (in our case, 8,000 MCMC iterations were required for acceptable

13  Mixing Grids and Clouds: High-Throughput Science Using the Nimrod Tool Family

231

Fig. 13.2Standard error of one posterior mean estimate over the number of MCMC iterations

accuracy). The computational cost is further increased by using a technique called ‘leave-one-out cross validation’ to evaluate SPM’s predictive performance. That is, for each visitor, we trained SPM with a reduced dataset containing the data of 157 of our 158 visit trajectories (following the above-mentioned sampling procedure), and used the withheld visitor pathway for testing. Owing to these factors, evaluating one SPM variant with our dataset requires approximately 6 days of compute time on a modern commodity-cluster server such as the East Enterprise Grid (http:// www.enterprisegrid.edu.au). This equates to over 22,000 CPU hours. Compounding this is the need to explore a collection of different SPM variants, nine over the course of this research, adding up to approximately 200,000 CPU hours for a modest-size research project.

13.4.3  The Experiment

The full-quality (8,000 MCMC iterations) experiments were run over a few months using Nimrod/G to distribute thousands of model trainings across university-level computational grid resources that were available on an opportunistic basis. However, relying on opportunistic resources makes it difficult to provide a quality of service (QoS) guarantee on the run-time of the experiment. Hence, if only limited resources are available in the presence of a deadline, one might be forced to reduce the run-time of the jobs, jeopardising the estimation quality. Alternatively, the Nimrod EC2 extension allows the scientist to expand their resource pool for a fee rather than compromising the quality of their research in such a situation.

232

 

 

 

B. Bethwaite et al.

Table 13.3

Computational resources used for the experiment

 

 

 

 

 

 

 

 

 

Resource

No. of cores

Compute/core

Memory/core

Cost/core

 

East

120

1.6

GHz Intel Xeon E5310

1 GB

N/A

EC2a

152

2.5

EC2 Compute units

0.875 GB

US$0.10

EPC

4

Unknown

0.5 GB

N/A

a We used c1.xlarge type instances with AMI-0dda3964, hosted in the US-East EC2 region

To demonstrate the EC2 extension and applicability of Nimrod’s existing economic scheduling capabilities, we ran a smaller version of the full sweep experiment – 9 SPMs by (158 reduced datasets + 1 complete dataset), totalling 1,431 distinct tasks) – by decreasing the number of MCMC iterations from 8,000 to 80 (a reduction in the computational requirements by a factor of 100).

A back-of-envelope analysis using the approximate 6-day run-time (for jobs on our East cluster) from the full-quality experiments (scaled down 100 times) reveals that with East alone, this experiment would take at least 17.2 h, though this is below minimum because it also scales input and output file copy time and start-up overhead, which are uniform for the full and scaled experiments.

We enacted a scenario requiring results overnight (a 12-h deadline) with a limited free resource set (listed in Table 13.3). The free resources alone are incapable of meeting the projected throughput requirements. Hence, we added EC2 with a US$100 budget and selected a cost-minimisation strategy from the scheduler (as discussed in Section 2.3). Our free resources included the Eucalyptus Public Cloud (EPC) (http://open.eucalyptus.com/wiki/EucalyptusPublicCloud), so that we could demonstrate Eucalyptus compatibility and the private cloud tunnelling mechanism (the EPC does not allow outgoing connections from machine instances). The EPC is simply a public demo system for testing Eucalyptus, and provides no service guarantees and is not intended for computation. Users can run up to four instances concurrently and a 6-h maximum instance up-time is enforced.

13.4.4  Computational and Economic Results

The number of jobs in execution on each resource over time is shown in Fig. 13.3. All 1,431 jobs were completed successfully within the deadline and budget in 11 h and 6 min, having spent US$68.35 according to the economic scheduler. The scheduler quickly filled the available cores on East and gradually scheduled to EC2, soon adding more jobs to EC2 once East was fully allocated. The delay between the jobs starting on EC2 and East represents the EC2 instance provisioning wait-time. No queue wait-time was experienced on East. Before the halfway point, the scheduler began to estimate that East was capable of finishing the remaining jobs, and because of the cost-minimisation bias, EC2 usage was quickly reduced to nothing despite it having the highest throughput. The EPC, being a low-performance demonstration system, never managed to complete a job before the 6-h instance

13  Mixing Grids and Clouds: High-Throughput Science Using the Nimrod Tool Family

233

Number of Running Jobs

150

140

130

120

110

100

90

80

70

60

50

40

30

20

10

0

Task Parallelism

East

EC2

EPC

60

120

180

240

300

360

420

480

540

600

660

Time Elapsed (minutes)

Fig. 13.3Jobs executing over time

 

 

Table 13.4

Completed job statistics

 

 

 

 

 

 

 

No. of jobs

 

m / s Job run time (min)

Resource

completed

Tot. job time (h:m:s)

 

 

 

 

East

818

1245:37:23

91.36/5.70

EC2

613

683:34:05

66.91/14.23

Note: the EPC is omitted because it did not complete any jobs

limit was reached. The EC2 actuator recovered from this and restarted the proxy on a newly launched instance. However, the jobs were rescheduled to East.

Owing to the economic scheduler not being aware of the time-slice charging of EC2 instances, it underestimated the real cost by approximately US$20. It also does not currently consider data-transfer charges (although transfer statistics are recorded), but here they were negligible. This underestimation may seem considerable. However, it is exacerbated in this case by the high level of instance parallelism and relatively short instance run-time. The schedulers estimate would be much better if, for example, fewer instances were used for longer (the user can enforce this).

Table 13.4 shows run-time statistics of the completed jobs, which provide insight into the performance of EC2. EC2 completed jobs in 74% of the time that East took (25-min better on an average). This is slightly worse than the EC2 compute unit rating suggests (a direct comparison between East and EC2 is possible because East has the same vintage Xeon CPUs as those used in the EC2 rating). Each EC2 core on a c1.xlarge instance should provide roughly 2.5 GHz when compared with East’s 1.6 GHz – therefore, we expect EC2 to take approximately 64% of the time of East. Of particular note is the higher run-time standard deviation on EC2. A possible explanation for this is that the physical hardware may have been shared

Соседние файлы в папке CLOUD COMPUTING