Containerized Spark Deployment Pays Dividends
Hadoop has emerged as a general purpose big data operating system that can perform a range of tasks and run all kinds of processing engines. But all that power and flexibility comes with a cost, which is something that one prominent healthcare analytics firm decided it didn’t want to pay anymore.
Despite its generic-sounding name, The Advisory Board Company has a very specific goal: help its clients in healthcare implement best practices to improve financial results and patient outcomes. To that end, the Washington D.C.-based firm makes extensive use of the latest analytic technologies and techniques to help its clients, which includes 90% of the country’s hospitals.
The company is in the process of migrating many of its analytic processes from relational database technologies to modern systems, including Apache Hadoop and Apache Spark, and the latest data science tools, such as RStudio, Apache Zeppelin, and H2O.
Its big data team is using these newer technologies as part of a major rewrite of the extract, transform, and load (ETL) programs that upload data from its client’s computer systems on a monthly basis.
That ETL process is a critical first step that must be completed before Advisory Board clients can utilize its extensive collection of software as a service (SaaS) offerings. Once the data is in place, the firm can offer a range of analytic capabilities and metrics that help hospitals earn higher profits while getting better outcomes for patients, to boot.
The company decided to use Hortonworks distribution of Hadoop to host the extensive collection of transformations and validations that compose the Advisory Board’s ETL library. Then it brought in the virtualization firm BlueData, which has been called the VMware of big data, to quickly spin Hadoop clusters up and down in support of its developers’ and engineers’ processing needs.
But Advisory Board soon discovered that the entire HDP suite was not required for the particular job. According to managing director Ramesh Thyagarajan, the company went through several iterations in search of the right technological tools and processing architecture to host these critical ETL processes.
Eventually, the company discovered Apache Spark was a good tool for this job, and that’s when it started questioning the need to run the entire Hadoop stack on its production Supermicro cluster, which consists of 10 nodes, each of which has 20 CPU cores and 1TB of RAM.
“We were running the entire Hortonworks Hadoop distribution in each one of the clusters,” Thyagarajan tells Datanami. “Once we went with Spark, we realized we didn’t need all the software to run these things, because it was eating up resources. We are not utilizing the software, so let’s change the whole architecture and then go with Spark containers.”
BlueData’s software, called EPIC, supported the capability to rapidly spin up Spark resources using Hadoop’s resource scheduler, YARN. But at some point, Advisory Board ran into issues getting Spark master and worker nodes to communicate in a YARN cluster, Thyagarajan says, so the company looked to other resource schedulers.
“We tried several ways to figure out what is happening and we finally said, this is not working, so we went with Mesos,” Thyagarajn says. “Now we are using Hortonworks just as an HDFS base…Mesos is my scheduler and the resource manager, and BlueData provides me the clusters.”
Apache Spark provided a big performance boost for the ETL jobs. According to Thyagarajan, the new Spark-based ETL process has gone from taking 21 days to load a client’s data down to two hours on the same cluster. “So we’re able to achieve a great level of speed in processing,” he says.
The combination of the EPIC virtualization software and the Mesos scheduler lets Advisory Board give Spark containers all the processing power they need, and then quickly scale them back down to free up resources for other workloads. The company is also running other non-ETL workloads, such as analytics, using HDP and BI/ETL tools on BlueData, the vendor advises us.
At peak load, when the hospitals are trying to uploads data into Advisory Board’s cloud, there will be nearly 30 different containers simultaneously running on the 10-node cluster. “Once the job is done, the container is all gone, except for a few key feed containers that…we keep running all the time,” Thyagarajan says.
The data science tool are also benefitting from the slimmed-down big data setup. Thyagarajan says an R model that used to take upwards of 40 hours to complete can now within 40 minutes under Spark on Mesos.
Without the combination of BlueData and Mesos, Thyagarajan estimates Advisory Board would need at least a 60-node cluster to handle the work. That’s six times bigger than what the company is currently using. “We have seen in every aspect of data management, what we’re doing is really paying dividends.”
It’s all about optimizing the cluster resources, says Jason Schroedl, vice president of marketing for BlueData.
“By using containers, we allow you to break from the traditional Hadoop deployment,” he says. “And as you add in new capabilities, new tools out in the open source ecosystem, and new versions of tools continually coming out, we allow our customers to spin up those clusters running in container without the more rigid architecture that typically comes with a traditional bare-metal Hadoop deployment.”