A Tale of Two Hadoop Journeys
Hadoop brings different things to different companies. For some, the Hadoop platform provides a great starting point to begin analyzing large data sets. But for established companies, Hadoop often displaces existing investments in data warehousing and business intelligent tools.
The Hadoop implementations at OPower, an electricity usage analysis firm, and Edmunds.com, the automobile website, are both cases of the latter. That is, Hadoop allowed the companies to do things that they couldn’t achieve with their existing tools.
In the case of OPower, a Hadoop cluster gave it the computational horsepower that enabled it to grow its business. “We were in a situation where we were getting more and more clients, which is a great business problem to have,” Drew Hylbert, vice president of technology and infrastructure at OPower, said at a recent Hadoop Summit 2013 conference. “But doing all the processing that we needed to do on MySQL, which we were using, was not going to allow us to generate the amount of insights we needed to generate, both in terms of latency and coverage.”
In particular, MySQL was not up to the task of delivering a new product to its electric utility customers called bill forecasting. According to Hylbert, the company needed to bring together large amounts of disparate data–including things like weather data and usage data–in order to forecast how much money the electricity bills would be for Pacific Gas & Electric customers. “We just had no way to produce a daily bill forecast in the SLA
that PG&E wanted, unless we moved to a distributed computing model,” he said.
Hylbert, who previously was a MapReduce engineer at Yahoo, realized that Hadoop would be a good fit, in part because OPower had a large contingent of Java developers. After building a Hadoop cluster using open source Apache distribution, the company was ready to take the next step and move it into production. That’s when Hylbert began shopping for a commercial Hadoop product.
“We did go through the exercise of, how we were going to use this platform, and what distribution had better coverage of features than open source, and whether [to use] HBase or Hive, and which distributions could support our use case,” Hylbert said. “We knew from the start that we … needed to use HBase, because we didn’t want to just process all the time-series data in batch. We wanted to use it as a source of record, which meant we would need to do real time low latency lookups of individual values in the store.”
OPower’s bill forecasting product went live on Cloudera’s distribution in 2012, due largely to Cloudera’s HBase support, according to Hylbert. Bill forecasting has helped OPower with its objectives of reducing electricity consumption. But the company actually has implemented a second Hadoop cluster that solved another problem: proving that its efforts to alter electricity consumption patterns actually work.
OPower had used traditional sampling methodologies to try and prove the worth of its programs. Some of the sample sizes were quite large–50,000 homes or businesses. But even this large sample size was insufficient. “We didn’t have a way where we could prove that our programs were actually reducing peak [loads] until we put another Hadoop cluster,” he said. The new cluster was more of a traditional data warehousing cluster that included existing reports, usage data, communications with customers, and weather data. “We were actually able to prove that the OPower program reduces peak demand,” he said.
The data warehousing team at Edmunds.com had similar challenges. According to Paddy Hannon, the company’s vice president of architecture, its existing business intelligence infrastructure was unable to keep up with the volume of data being generated.
“Our journey started with a lot of pain,” Hannon said during the Hadoop Summit conference. “We were a traditional data warehousing group using Oracle, Informatica, etc. We were using all the basic tools and we were getting more and more data.”
As one of the Internet’s top automobile websites, the company acts as a clearinghouse of pricing and inventory data for new and used cars around the country. Getting all the car-related information into the website–including recent car sales and other transactions–into the data warehouse was taking more than eight hours every night.
“If something failed you had to restart that process, and pretty soon, you run into a point where you can’t possibly catch up,” Hannon said. The data warehouse was missing week’s worth of data, “which we used to recognize revenues, so that’s sort of a problem,” he said.
The other problem afflicting Edmunds.com was the slowness of the traditional way of generating and delivering more than 300 reports to business analysts. The whole process of obtaining requirements, designing the data mart, collecting the data, building the reports, and then generating actual reports was a three to six month ordeal at Edmunds.com.
“I wanted that to go a lot faster,” Hannon said. “I was used to the Web development world, where we have three week iterations–every three weeks, we had a new Web page and could see how it worked. I didn’t see why we couldn’t do that in a data warehouse environment.”
Hannon drove the adoption of Hadoop at Edmunds.com to address these two issues. Because Edmunds.com already had a lot of Java development talent, going with Hadoop was a “no brainer,” Hannon said. It also selected Clouder’s distribution due largely to its support for HBase.
However, as head of the data warehousing team, Hannon make the decision not to bring in outside data warehousing experts, but instead to leverage its existing Java engineers. That made for a slow start, he said. “We kind of floundered for a while,” he said. “We weren’t quite sure how to get started. We were bringing data in, but it didn’t feel like we had a lot of traction.”
The big breakthrough came when Hannon’s team started building a new pay marketing system on Hadoop. “It allows us to do something we couldn’t do in our Oracle warehouse because we couldn’t crunch enough data,” he said. “Were able to take our clickstream data, our ad revenue data, and our paid marketing data and show how much we were spending to acquire visitors on a per-click basis, essentially…From there on out, it stopped being a battle to convince people that Hadoop was the way to go.”
That is not to say that Hadoop is the end-all, be-all for data analytics at Edmunds.com, which continues to use other tools, including MongoDB, Amazon Redshift, MicroStrategy, Netezza Postgress, and Platfora. But when it comes to data warehousing, Hadoop has enabled given the company more agility.
Cloudera Search 1.0: Like Googling Hadoop
MapR Gooses HBase Performance in Pursuit of Lightweight OLTP