Follow Datanami:
April 18, 2014

Nine Criteria to Achieve Hadoop-as-a-Service Happiness

Raymie Stata

The big data industry has made a compelling case for Hadoop as the core platform for a big data strategy. But deploying and maintaining a well-run Hadoop environment is challenging, to say the least. This difficult has driven the Hadoop-as-a-Service (HaaS) market. However, not all HaaS offerings are the same. Here is a look at the different types of HaaS available today, as well as the nine key criteria that will make your HaaS evaluation a success.

Early adopters running Hadoop on their premises discovered that the distributed nature of Hadoop, its unusual infrastructural requirements, and the busty nature of its workloads made it both difficult and expensive to maintain such an environment. Over time, the IT industry has been evolving increasingly capable methods of using the cloud to address these problems.

Early in-house deployments of Hadoop clusters were constrained by the number of servers available in the cluster. Bursty workloads caused clusters to be simultaneously over-provisioned for the “steady state” and short on resources for burst loads, which in a massively parallel environment can have crippling performance impact. Using cloud infrastructure as a service (IaaS) offerings, businesses could reconfigure environments when new jobs, new data, or new requirements hit the cluster. The elastic nature of cloud computing was a fit for these problems.

Then IaaS service providers started providing integrated, tested Hadoop distributions along with their more conventional IaaS solutions. Services like Amazon’s Elastic MapReduce (EMR) reduce the know-how required to allocate appropriate IaaS resources and deploy Hadoop software onto those resources. And by providing an environment of tested and integrated Hadoop ecosystem components, these solutions alleviate some of the administration tasks associated with Hadoop, such as patching software and determining initial settings.

However, while providing a more packaged Hadoop solution, these environments still assume the customer will “run it yourself” (RIY). These RIY offerings still required substantial Hadoop skills to configure, tune, and operate Hadoop. Bursty workloads can be addressed through the elastic nature of the underlying IaaS, but utilizing these resources requires manual intervention.

Finally, in a major advance over RIY offerings, some vendors have begun to provide Hadoop itself as a service.  These “pure-play” HaaS offerings are similar to other SaaS offerings, such as Salesforce.com, where the end-user uses but does not operate the underlying software.

Pure-play Hadoop services offer significant advantages over IaaS and RIY environments.  These offerings do not require reconfigurations as data sizes grow and contract. Since the Hadoop environment is fully managed for users, there is no need to develop deep in-house expertise in Hadoop management. There are performance advantages to pure-play offerings as well. Since all data is stored in natively in Hadoop, it is immediately and always available for production and analysis jobs.

Nine Criteria For Selecting a HaaS Solution

With an understanding of how HaaS has evolved, how does one now choose the best-matched offering? The following are nine service criteria one should assess when choosing a Hadoop as a Service solution.

Criteria 1: Expressive for users, simple for administrators

The service essence defines the essential characteristics of what the customer experiences with a HaaS. The service essence must meet the needs of both data scientists and Hadoop administrators.

Data scientists typically desire a functionally rich and powerful environment. HaaS should allow data scientists to easily run Hadoop YARN jobs through Hive, Pig, R, Mahout, and other data science tools. These services should be immediately available when the data scientist logs into the service to begin work. This type of “always on” Hadoop service avoids what can be frustrating delays when one must deploy a cluster and load data from non-HDFS data stores before starting.

While data scientists should have a wide range of options and tools for working with Hadoop, for systems administrators, less is more.  Their job typically entails a set of management tasks and the interface should be streamlined to allow them to perform those tasks quickly and with a minimal number of steps.

Criteria 2: Data at Rest is Stored in HDFS

Users should not have to manage data in storage systems that are not native to Hadoop, or be required to move data into and out of HDFS as they do their work.  HDFS is industry tested to provide cost effective, reliable storage at scale. It is optimized to work efficiently with MapReduce and Yarn-based applications, is well suited to interactive use by analysts and data scientists, and is compatible with Hadoop’s growing ecosystem of third-party applications.  HaaS solutions should offer “always on” HDFS so users can easily leverage these advantages.

Criteria 3: Hadoop is Self-configuring

HaaS solutions should dynamically configure the optimal number and type of nodes and automatically determine tuning parameters based on the type of workload and storage required.  The optimized environments these services provide dramatically reduce human error, reduce administrative time, and provide results faster than customer-tuned environment.

Criteria 4: Elasticity

Elasticity should be a central consideration when evaluating HaaS providers. In particular, one should pay attention to the degree to which the service handles changing demands for compute and storage without manual intervention.

In addition, one should consider multiple dimensions of elasticity. On the storage side, does the storage automatically expand and contract HDFS capacity as new data is added and old data is deleted? Does the HDFS automatically accommodate intermediate outputs of jobs, which can be substantial at times? On the compute side, does the solution automatically support ad hoc analysis by data scientists, which can be unpredictable in their arrival and substantial in their resource requirements?

Criteria 5: Non-Stop Operations

Big data environments present more challenging operating conditions than one finds in non-parallel applications, including:

• The need to restart failed subprocesses of a large job to avoid restarting the entire job

• Jobs that starve for resources and finish late (or not at all), even when resources are available.

• Deadlock, which occurs when one process must wait for a resource held by another process while the second process simultaneously waits for a resource held by the first process.

Non-stop Hadoop operations addresses these and other problems unique to the Hadoop environment. In-house and RIY environments are especially prone to problems maintaining non-stop operations because it requires deep Hadoop expertise and tooling.

Criteria 6: Ecosystem Component Availability and Version Tracking

As big data adoption accelerates throughout the industry, so do the capabilities found within the Hadoop ecosystem. This is clearly seen in the development of in-memory analysis engines, low-latency SQL for Hadoop, machine learning, and the increasing number of scheduling, workflow and data governance tools. These new capabilities often require the features found in the very latest releases of the core Hadoop components.

HaaS providers should constantly monitor developments in the ecosystem, and be a trusted advisor to their customers on when, why, and how to adopt them. Further, HaaS providers should keep their environments up-to-date with the latest releases of Hadoop – and help their customers migrate along with them – so their users can take advantage of the latest developments

Criteria 7: Cost Transparency

Anticipating and managing costs, even in relatively simple IaaS environments, can be difficult. The problem is even more pronounced in complex and dynamic big data environments where compute and storage costs are difficult to understand. This is especially the case when multiple variables, such as the types of instances and sizes of virtual machines, must be considered.

To achieve reasonable levels of cost transparency, pricing models should ultimately tie to Hadoop units of work and capacity, specifically YARN job units and HDFS storage. These are fundamental units of work and storage so one can make comparisons across services that use such pricing models. HaaS providers that bill for services in terms of Hadoop units of work and storage capacity may ultimately be less expensive than other services, especially when considering total cost of ownership.

Criteria 8: Holistic Job Tuning

Experienced big data practitioners continuously monitor jobs to collect information on how to tune workloads. The run time of complex jobs can be cut by as much as 50% through this iterative design-run-monitor cycle, which is especially important for repeatedly-run jobs. Job tuning in Hadoop environments involves the entire service stack as shown in Figure 1. Inefficient jobs can be caused by problems at any layer of the stack. For example, the application logic may use one of the elements in the application framework ineffectively or the root cause of a performance problem may be due to a misconfiguration in the Hadoop/YARN layer.

Figure 1. The Hadoop service stack consists of multiple layers that all must be tuned for optimal performance.

While providing operational feedback in this iterative design-run-monitor cycle through a HaaS implementation can be delicate, state-of-the-art monitoring developed with this holistic job tuning in mind can bring job tuning efficiencies to the user.

Criteria 9: Team Experience

The final criterion to consider when evaluating HaaS providers is experience of the team behind the service. Building any large scale Hadoop environment is a complex and difficult task.  Maintaining such an environment in the face of widely varying workloads across multiple customers adds to the difficulty.  Prior to committing to a HaaS provider, it is advisable to understand the experience and background of the support staff responsible for big data analysis and production jobs.

Balancing the clear benefits of Hadoop are substantial challenges to implementing, configuring and maintaining a Hadoop environment. Various forms of HaaS have emerged to address these challenges. Use the several criteria outlined in this article to select one that best meets your needs. 

Raymie Stata is the CEO and co-founder of Altiscale. Formerly the CTO at Yahoo, Stata played an instrumental role in algorithmic search, display advertising, and cloud computing. He has also worked for Digital Equipment’s Systems Research Center, where he contributed to the AltaVista search engine. Raymie received his PhD in Computer Science from MIT in 1996.

Datanami