Follow Datanami:
April 3, 2013

Deployment and Active Management of Hadoop in the Clouds

Paul Speciale, Appcara

The use of public clouds for both infrastructure and business application deployment has taken root for many businesses, but it remains relatively new approach as a delivery vehicle for large-scale distributed big data size applications such as Hadoop.

Certainly the ability to spin up cloud services for application development and testing is appealing. Clouds can eliminate the need to procure capital, the time-consuming physical installation, configuration and management of the underlying infrastructure, as well as creating a dramatically more agile DevOps environment.

From simple web services to more sophisticated enterprise applications delivered via the SaaS model, CIOs and IT managers are already achieving tremendous cost- and agility benefits to cloud-based application development and deployment. Having such advantages for their most complex and often costly big data applications would be nirvana.

In principle, cloud deployment of Hadoop makes sense on many levels: availability of virtually unlimited resources for very large scale Hadoop workloads, on-demand scaling and bursting, usage-based billing, all create appealing reasons for big data apps to take advantage of the cloud. 

Figure 1: Key advantages gained from running Hadoop in the cloud

In practicality, however, when working at Hadoop scale, these applications can become problematic to stand up, manage and maintain in cloud environments. Because Hadoop workloads can involve tens, hundreds or even thousands of servers, managing them effectively – making sure each instance in a cluster is set to the same configuration and dependencies – and maintaining them all consistently in case one or several have to be changed, or new servers are added – can turn into a headache for DevOps teams.  

Cloud service providers are beginning to offer big data services in their public clouds. With various offerings and price points to fit diverse enterprises needs depending on sector, service providers seem to be approaching the big data problem in very specific ways, by offering optimized cloud environments for running these large-scale applications with maximum efficiency and performance:

  • Hadoop can be optimized by specific infrastructure capabilities, for example Hadoop clusters often incur IO bottlenecks against standard hard disk devices (HDD).
  • Some service providers are solving this performance problem by offering high-performance SSD (Solid Stage Disk) storage in their clouds, and in some cases at prices competitive to HDD.
  • Cloud leader Amazon Web Services as usual is also addressing the opportunity – with virtual High Memory Instances, High Memory Cluster Instances, Cluster GPU Instances and High IO Instances that are more optimal for HPC workloads.

Given the enormous potential for big data services in the cloud: retailers, airlines, manufacturers tracking customer and product activities, and certainly large research institutions would have a big data-sized appetite for resources in the cloud – Service Providers must clearly be seeing customer demand for running big data applications in public clouds.

The challenge, and the roadblock to Hadoop in the cloud deployments, has less to do with service provider and processor power capabilities, and more to do with how IT can manage it all.  Hadoop is a terrific example of a highly distributed, complex application that is in need of more holistic management capabilities. Deploying and managing such large-scale applications isn’t a trivial task.  Given the number of elements in a large cluster, the amount of manual work required to install, configure and manage these applications in a node-by-node piecemeal manner can become overwhelming. In order for Hadoop clusters to effectively run in the cloud we need tools that can provision, configure and automate the ongoing management of potentially thousands of nodes in a deterministic and consistent manner.

Within the past 12 to 18 months tools that can effectively manage these complex, distributed applications have been arriving on the market, and promise to further enable the cloud as an effective delivery vehicle for Hadoop.

One approach that has been adopted for provisioning of simpler cloud applications is to create a library of standardized Server Templates representing the various components. This would entail standardized Templates for components such as Hadoop Master and Slave Nodes. These templates specify the server’s required operating system, application packages, scripts and default configurations for automating the provisioning of these components. While this works well for provisioning the application as a set of components, it does not enable a solution for management of the application as a holistic single entity, and for managing the rapid change/add/update cycle that these apps will likely incur. Moreover, a Template based approach tends to be static, such that if changes are required to an underlying Hadoop component it will result in a change to the underlying Template, and likely result in a version management headache.

As an alternative, the Platform-as-a-Service (PaaS) approach that is being offered today for application development and as a toolkit for enterprise applications, can be extended to offer an environment for designing, building and deploying big data apps in the cloud (BDaaS?). While this could streamline the build-deploy-manage cycle, as with other PaaS offerings it usually limits the available library of components to the ones integrated with and supported by the PaaS environment, as well as restricting the available choice of clouds to the one provided by the PaaS service provider. This can limit the openness of the tool set and cloud services to a greater extent than is desired.

A different approach to cloud application management that is being utilized in next generation cloud application platforms is to utilize a dynamic, data-model driven approach to capture the entire “Workload” definition, including all the multiple application components. In such systems, the underlying data-model stores all the properties and elements of the application, including the package definitions and operating system details, down to the configuration parameters and – critically – the interdependencies between the components (for example, network parameters and host-slave relationships). This works especially well for multi-component and distributed applications such as Hadoop where it will be beneficial to manage the entire cluster holistically.

Figure 2: Data-model driven management of big data apps in the cloud

By maintaining this application data model in an active repository it becomes possible to capture changes in real-time by updating the repository, to ensure that all elements are consistent and that any changes are automatically reflected across all instances of the underlying Workload.  This changes the application management approach from static (one time provisioning) to dynamic, by enabling automation of ongoing management tasks such as configuration changes and dependency management. This reduces the burden on the application operator to manually proliferate changes across a potentially huge number of independent elements. Moreover, this approach can be completely cloud agnostic, as it captures the application layer information and decouples it from the underlying cloud dependencies. This creates freedom to deploy large-scale Hadoop implementations in any cloud, or across multiple clouds.

big data applications such as Hadoop are poised to take advantage of scalable, on-demand cloud services, sooner rather than later. To enable this, the ability to manage large-scale distributed applications as a single entity, with simplified interfaces for provisioning, managing and lifecycle tasks is incredibly appealing – both for reducing time and errors spent on low level tasks – and for dramatically improving the productivity of big data apps.

Paul Speciale is chief marketing officer for Appcara, provider of a model-based cloud application platform. He has over 20 years of experience in assisting cloud, storage and data management technology companies as well as cloud service providers to address rapidly expanding Infrastructure-as-a-Service and big data sectors. 


Related Items:

Baldeschwieler: Looking at the Future of Hadoop

MapR Turns to Ubuntu in Bid to Increase Footprint

Sharing Infrastructure: Can Hadoop Play Well With Others?