Скачиваний:
50
Добавлен:
20.06.2019
Размер:
50.48 Mб
Скачать

13  Mixing Grids and Clouds: High-Throughput Science Using the Nimrod Tool Family

225

Table 13.2Adaptive scheduling algorithms in Nimrod/G

 

 

 

 

 

 

 

 

 

Execution time

Execution cost

 

 

Scheduling strategy

(not beyond deadline)

(not beyond budget)

 

Time

Minimise

Limited by budget

Time optimal

Minimise

Unlimited budget

Cost

Limited by deadline

Minimise

 

 

None

Limited by deadline

Limited by budget

 

Nimrod implements four different adaptive scheduling strategies (listed in Table 13.2) [20] that attempt to meet deadline and/or budget constraints, possibly while minimising execution time or cost. Over the duration of the experiment, the scheduler refines a profile for each computational resource and alters the job allocation accordingly.

As commercial clouds enable researchers to easily expand their computational test-beds beyond the confines of university-level infrastructure, it is likely that they will wish to do so while continuing to utilise their local resources and minimising expenditure on cloud time. For this reason, the experiment we present in Section 4 uses the cost-minimisation strategy.

13.3  Extensions to Support Amazon’s Elastic Compute Cloud

Recently, we developed a new actuator and associated components capable of interfacing Nimrod with compute clouds offering EC2-compatible APIs. In addition to allowing Nimrod jobs to be run on EC2, it also supports Eucalytpus [21] and OpenNebula [22] clouds. In this section, we discuss the extension and provide a discussion of issues involved in writing applications for such clouds.

The EC2 service provides one of the most generic and low-level interfaces to Infrastructure as a Service (IaaS) utility computing. At its most basic, it simply allows clients to start an instance of a particular virtual machine image on one of a handful of virtual hardware configurations. Further use of that instance is afforded via Secure Shell (SSH) access, which EC2 supports by providing a service for generating and managing SSH cryptographic key-pairs. Instances are then pre-configured with a key of the client’s choice when a virtual machine is booted.

In contrast to IaaS, grid middleware typically provides a Platform as a Service (PaaS) to some representation of a computational job – usually an invocation of some program, optionally staged into the remote machine as part of the job and potentially specifying a number of options relevant to the local resource manager (LRM) or batch queuing system.

Developing or adapting an application to use EC2 can be challenging, and often requires writing code for tasks peripheral to the main purpose of the application such as machine provisioning and management. Grid clients, such as Nimrod, typically deal with middleware interfaces at the level of job management and file-transfer services without concern for the lifecycle of the machine on which the

226

B. Bethwaite et al.

jobs might run. Typically, Nimrod only needs to know (or discover) the resource type and contact details, along with the architecture of the platform, in order to use it as a computational resource. No explicit management of the underlying computational resource is normally required. Hence, management of the virtual infrastructure is the largest part of the extension.

Rather than using a higher-level IaaS or PaaS built on top of EC2, we opted to use the basic core interface – the EC2 web service for provisioning and SSH for interaction. This means that the EC2 extension can accommodate a broad range of uses, and importantly for deployment it can utilise almost any virtual machine image suitable for running within the cloud. It is important to note that the extension provides an EC2 cloud execution mechanism for Nimrod, and that many of the higher level AWS services (such as Elastic Load Balancer) are not applicable to Nimrod because it is not a cloud-hosted service. We are simply interested in utilising the compute capacity.

To define a Nimrod EC2 resource requires a label, the service URL, the access and secret key file locations, a machine (and optionally, kernel and initial ramdisk) image identifiers, an instance type, and limits on the number of instances to run in parallel. There are further options, such as whether the use of a proxy or tunnelling is required, and most options have default settings for use with EC2.

13.3.1  The Nimrod Architecture

This section describes the architectural details of Nimrod relevant to the EC2 extension. Nimrod utilises a modular architecture that clearly separates the responsibility for various processes to a number of extensible modules. The modules are coordinated via a data model using a relational database management system (RDBMS), which also provides the basis for persistence and failure recovery. A particularly important feature of Nimrod is its use of a remote agent that runs on the computational node. The agent retrieves work from the root server (the server where the other Nimrod modules are running) and will, in a single execution, process as many jobs as available during its allotted time. This contrasts with the usual approach of submitting each job to the middleware separately, and helps mitigate the effects of unpredictable queue wait time on overall execution time. Nimrod, along with Condor [3] (which uses the glide-in mechanism), was one of the pioneering systems to use such a technique. Recent specialised high-throughput systems, such as Falkon [23], follow a similar approach. This approach is also well-matched to existing cloud services, as will be discussed in Section 3.2.

The agent is highly portable and can be built for several different architectures and operating systems. Importantly, there is no requirement to install any Nimrod components on the remote computational system prior to running an experiment, as the agent is staged in and launched using a variety of supported interfaces. This greatly simplifies system deployment and makes it possible to easily create an ad-hoc high-throughput engine by consolidating multiple computational resources.

13  Mixing Grids and Clouds: High-Throughput Science Using the Nimrod Tool Family

227

Nimrod jobs correspond to executions of a Nimrod task for a particular parameter combination. As discussed in Section 2.1, the Nimrod tools use customised computational-task description syntax, similar to a batch script. Depending on the tool being used, all jobs for a particular experiment may be added to the database prior to execution or added dynamically, for example, by some iterative process (e.g. an optimisation/search). While Nimrod is typically used for parametric studies of a statically defined task, it also accommodates assigning a different task to every job and thus provides a high-level general-purpose computational middleware service.

Most of the low-level machinery is concerned with interacting with computational resources and services in order to launch agents and ensure that those agents can contact the root – in some cases, this involves launching a proxy to bridge between private cluster networks and the root machine which is often on the public internet. Actuators interact directly with external systems, such as middleware services like the Globus Resource Allocation Manager (GRAM), batch systems, or other meta-schedulers. Actuators (1) perform resource information discovery functions (e.g. determining machine architecture), (2) transfer agents and their prerequisite files (contact details for modules on the root and symmetric cryptography keys for authentication) to resources and (3) subsequently launch or (in)directly queue batch jobs to launch agents. Actuators can be considered as resource-specific drivers for Nimrod, providing a uniform interface to various types of computational resource.

The agent schedulers decide what operations ought to be performed by the actuator for a given resource and experiment. They (1) trigger actions by the actuators in response to job-to-resource assignments made by job schedulers,

(2) enforce resource or user-specified limits on agent submissions and (3) for some resource types, schedule peripheral tasks such as credential refreshes.

Job schedulers assign work to the available resources using a choice of in-built heuristics, including cost and time minimisation on a per-experiment basis. It is also possible to do one’s own job scheduling through a specialised API. The job scheduler takes into account dynamic job metadata, collected by the agents, and continually refines the schedule throughout execution.

The database- and file-server provide agents with job and control data, and access to the experiment file system on the root, respectively. Computational resources can have a number of typical and differing network topologies with regard to connectivity to the internet. A thorough discussion is beyond the scope of this chapter; however, in these cases, Nimrod can launch the agent in proxy mode on one or more intermediate machines in order to provide network access for the agents to connect to the servers.

13.3.2  The EC2 Actuator

Figure 13.1 shows the architecture of the EC2 extension. The actuator provides the main control flow, an interface to events and data regarding resource configuration, and agent commands. Typically, Nimrod actuators interface with a library or call

228

B. Bethwaite et al.

Fig. 13.1EC2 extension architecture

out to command-line client tools in order to start agents on remote resources. A large part of an actuator’s implementation is devoted to adapting the interface offered by the external middleware or service, to the interface required by the actuator model. In Nimrod, this functionality is encapsulated in a resource module. The new EC2 resource module leverages the Boto (http://code.google.com/p/boto/) library for communication with the EC2 web service. We chose to use Boto because it is implemented in Python, like Nimrod/G, so we did not need to create a com- mand-line wrapper for the AWS Java client tools.

Boto provides client implementation for many of the current AWS query APIs, including EC2, S3, SQS, etc. The EC2 query API has been adopted by a number of other IaaS cloud projects offering software for creating private and hybrid clouds. Notable examples with support in current releases include Eucalyptus and OpenNebula. The EC2 actuator can provision Amazon EC2, Eucalyptus and OpenNebula cloud resources for use in Nimrod experiments.

As shown in Fig. 13.1, the EC2 resource divides instances into slots based on the number of processor cores per instance and the number of cores required per job. Agents are allocated to the slots and then launched via SSH once the instance has been initialised. Owing to limitations of the IPv4 address space, and like compute clusters, a common network configuration for private clouds uses a reserved private IP address range. We postulate that this will become more common as cloud computing is adopted for HTC. External access will be possible via a port-forwarding method from an intermediate device.

Соседние файлы в папке CLOUD COMPUTING