Скачиваний:
50
Добавлен:
20.06.2019
Размер:
50.48 Mб
Скачать

13  Mixing Grids and Clouds: High-Throughput Science Using the Nimrod Tool Family

221

issues that arise in the implementation of our Amazon specific adapters and some of the open challenges for Nimrod and similar tools.

13.2  High-Throughput Science with the Nimrod Tools

While computation is now widely used in scientific research, we frequently see studies that report on just a few simulations or the analysis of a small quantity of data. Such studies may be suggestive, but they are typically not robust in the sense of quantifying sensitivity to factors such as initial conditions, parameter choices and data used for parameter estimation.

The commodity parallel computing revolution promises to make such limitations unnecessary. Continued Moore’s Law growth in transistor counts in microprocessors, combined with physical limits on circuit size, is spurring the development of multi-core processors, which may be used alone or within larger multiprocessor systems to run large numbers of computational studies in parallel. Further, the emergence of commercial computing clouds means that researchers can access large amounts of computing power cheaply and quickly. Similarly, many fields that were once data poor now have access to multi-terabyte datasets, with commodity parallel disk arrays providing for low-cost storage and commodity parallel computers enabling rapid analysis.

While the availability of suitable commodity hardware is pushing high-throughput computing (HTC) into the realms of everyday science, such science would not be possible without the considerable tool support necessary to effectively leverage and orchestrate the data and processing resources. To gain the throughput necessary to obtain results in a timely fashion, it is often necessary to use multiple distributed resources, which comprise varied hardware, and typically run different software stacks. In some cases, the computational effort required dwarfs the resources provided by a researcher’s home institution and/or state and national initiatives, indicating that the researcher must either source the capacity elsewhere, compromise on accuracy or scope, or possibly abort their plans.

Resources for high-throughput science are typically commodity clusters managed by batch queuing systems or idle-cycle harvesting pools (e.g. Condor [3] pools), made available remotely through grid middleware interfaces, such as Globus [4], UNICORE [5] and GLITE [6]. These middleware stacks and the development efforts around them have focused on exposing and standardising the task/job and data-oriented services typical of the requirements of HTC and HPC workloads. There are a number of successful production grid initiatives operating worldwide, such as OSG, EGEE, TeraGrid and PRAGMAGrid. However, grid computing has not had a widespread adoption outside of scientific HTC, most likely because it has been specifically tailored for that application domain. There also remain significant technological barriers that slow adoption, such as interoperability [7,8] and application deployment [9].

The low cost, abundance and increasing performance of virtualisation technology, which is being exploited to consolidate computing infrastructure, promises to ease

222

B. Bethwaite et al.

and promote novel solutions to the deployment problem. This will also have a positive influence on interoperability between systems and HTC applications by allowing the same software stack from operating system to scientific application, to be built, hosted and run on one infrastructure, and relatively easily transferred to and run on another – whether it is a grid or cloud system.

13.2.1  The Nimrod Tool Family

Over the past 15 years, we have built, maintained and improved the Nimrod tool family. These tools automate parameter sweeps and searches using distributed computing resources. A user typically provides Nimrod with a plan file that contains information about the parameters and their values, and a description on how to execute the applications. Plan files are declarative and deliberately similar to the job scripts used by batch queue systems; however, they also expose file transfer and parameter substitution functionality.

Users specify the input files to copy to the computational node, the tasks necessary to execute the application for a single parameter combination, and the output files to copy back. Thus, the task syntax used in the plan file is intuitive, because it is declarative and mimics how a user might run the application on their machine or a local cluster.

Using Nimrod significantly decreases the effort required to scale-up the level of parallelism in a computational experiment. Users are able to add computational resources and associated credentials to Nimrod and choose any combination in order to create a logical high-throughput engine for each experiment. In this way, Nimrod provides meta-scheduling functionality by distributing jobs across multiple underlying resource schedulers.

The Nimrod tools have been successfully used in various research involving high-throughput science – with recent work in fields such as molecular biology [10], cardiology [11], chemistry [12] and climatology [13]. We actively pursue collaborations with specialists who have challenging and novel applications for parametric distributed computing.

Table 13.1 lists the major, actively developed, components of Nimrod. When we refer to Nimrod services or just ‘Nimrod’ without qualifying a particular variant or group, we are referring to Nimrod/G components.

13.2.2  Nimrod and the Grid

Nimrod targets different types of computational resources, ranging from local batch schedulers to distributed Condor [3] pools and Globus-enabled [4] grid resources. The latter leverages Globus functions that support remote job execution, file transport, security and resource discovery.

13  Mixing Grids and Clouds: High-Throughput Science Using the Nimrod Tool Family

223

Table 13.1Components of the Nimrod tool family (non-exhaustive)

Tool

Purpose

Utilises

Nimrod/G [1]

Provides distributed parameter sweep and single task

 

 

execution via grid and cloud mechanisms, plus economic

 

 

and deadline scheduling of jobs across multiple compute

 

 

resources. Importantly, Nimrod/G operates either as a tool

 

 

(usually via a web portal) or a middleware layer in its

 

 

own right, serving as a job management system for other

 

 

software, including the other members of the Nimrod

 

 

family.

 

Nimrod/O [14]

Supports design optimisation rather than complete

Nimrod/Ga

 

enumeration. Computational models are treated as

 

 

functions that accept input parameters and return an

 

 

objective cost value. Nimrod/O incorporates a number

 

 

of different search heuristics ranging from gradient

 

 

descent to genetic algorithms. Used in conjunction

 

 

with Nimrod/G, it can exploit parallelism in the search

 

 

algorithm.

 

Nimrod/E [15]

Provides experimental design techniques (e.g. fractional

Nimrod/Ga

 

factorial analysis) for analysing parameter effects on an

 

 

applications output. The outcome is a Nimrod/G style

 

 

sweep that explores only those parameter combinations

 

 

likely to influence the experiment’s results, reducing

 

 

the number of runs required to achieve useful scientific

 

 

outcomes.

 

Nimrod/K [16]

Integrates the above Nimrod tools into the Kepler workflow

Nimrod/Ga,

 

engine; along with a novel dataflow mechanism, this

Kepler

 

provides dynamic parallelism for Kepler workflows.

 

a The other tools utilise Nimrod/G as a distributed computing middleware, but can also operate independently by using the local machine as a compute resource

Nimrod’s Globus support began prior to the release of the widely adopted pre-web services Globus Toolkit 2, and has continued with more recent releases of the web services-based Globus Toolkit 4. As a result, it supports resources using both variants of the toolkit through the globus or gt4 actuator.

Recently, we used Nimrod/G to run a large experiment in protein crystallography, made particularly significant by its use of over 20 high-end clusters from several grids worldwide to provide the half-a-million CPU hours required for the experiment within 2 months [7]. Cloud computing has the potential to significantly increase throughput for such science, while decreasing the human effort involved in coordinating interoperability and deployment between resources.

13.2.3  Scheduling in Nimrod

Nimrod supports a pluggable scheduling architecture that allows it to use a range of different scheduling techniques. The simplest, a first-come-first-served approach, places jobs on resources in order to maximise throughput. This default approach

224

B. Bethwaite et al.

allows a user to leverage as many resources as possible. A range of schedulers also support a computational economy in which resource providers charge, and users pay, for service. This allows a user to express the importance of their experiment in terms of a deadline, combined with a computational budget. Nimrod/G pioneered this approach in 2000 [1] when no such infrastructure existed.

Originally, the idea of a computational economy was to provide a common language in which different users could compare their resource requests. Within a finite economy, users who were prepared to expend more of their grid dollar (G$) budget were more likely to complete computations within their deadlines. This approach was expanded into an architecture in which users paid for services, and service providers charged [17].

Commercial clouds now form the first publicly accessible computational economy, making economic computational and data scheduling especially significant and topical. In commercial clouds, service providers charge ‘real’ money based on the cost of provision. Importantly, in this work, we have merged these two different uses of currency, and have leveraged the earlier work in a computational economy to embrace commercial clouds.

The existing job scheduler has been designed for space-shared batch-queued systems, as is typical on a computational grid. It was envisaged that these resources would charge for some absolute atomistic measure of computing used (e.g. MIPS), rather than in time-slice as is the case with EC2. This means that the scheduler will underestimate the budget used and will not recognise the time already purchased. However, as we have shown in Section 4, the current implementation is still applicable; implementing a time-slice scheduler will be a subject of future work.

As a consequence of the Nimrod tools specialising in parameter study applications, the job scheduler is able to make reasonable assumptions about job execution times, resource performance and job throughput. Many modelling applications have low variance in their processing requirements between the parameter sets (e.g. the case study in Section 4), though there are certainly exceptions, for example, the case study in [7]. Nimrod’s economic and deadline-scheduling algorithms exploit this property of the workload to provide soft deadline and budget guarantees. Much theoretical and practical work has been devoted to the area of scheduling, with wildly varying approaches. Some strive to meet hard deadlines on an inherently unreliable distributed infrastructure by using task-replication algorithms [18], others mandate an omniscient super-scheduler; some assume historical data to predict non-deterministic events, and still others employ statistical inference and machine learning to predict and adjust reliability [19].

Nimrod takes a practical, adaptive, approach by requiring no extra information or service. This is important because, from our experience, we observed that users often have little idea of the computational requirements of their models across varying hardware or inputs. Also, for the typical workload (with low job run-time variation and an order of magnitude greater number of jobs than parallel processing units), this produces results very close to optimal, and for the typical user, near enough is good enough.

Соседние файлы в папке CLOUD COMPUTING