Anatomy of a Hadoop Project Failure
Several years ago, the educational technology company Blackboard selected Apache Hadoop to run a new data analytics application designed to turn data exhaust into actionable insight. Months later, the failed project was cancelled, and Blackboard implemented a hosted relational data warehousing product instead.
The reasons behind Blackboard‘s initial selection of Hadoop for this project will sound familiar: a desire to maximize data exhaust, a need to bring large amounts of data together for analysis, and a curiosity to work with emerging technology.
But the factors leading to the Hadoop failure will also ring a bell to those experienced with Hadoop projects: difficulty integrating opens source pieces, complex architectures and data flows, and an inability to read data from Hadoop in a useful and timely fashion.
Maximize Data Exhaust
On paper, Blackboard would seem to be a good candidate for Hadoop. As the leading provider of learning management systems for the educational sector, the Washington D.C.-based company is well-versed in technology. The company counts 16,000 direct clients and touches more than 100 million students, who interact with educators and other students via its Web-based products.
“We have access to the largest data set in higher education in the world,” says Jason White, Blackboard’s director of product development, who led the Hadoop project. “We attempted a couple of years ago to build out a data lake to support not only the data science research but also some internal telemetry analysis of product usage. We attempted to build that out in a Hadoop stack.”
The main reasons for choosing Hadoop were scale, a desire to unify data, and technical curiosity on the part of Blackboard engineers, White says. The company had lots of familiarity with Microsoft SQL Server stack, and ran a data warehouse based on SQL Server Analytical Services. But it wanted to expand beyond what SQL Server could provide.
“Nobody inside of Blackboard had attempted to consolidate the vast amount of telemetry data and application log file exhaust that we had in the organization,” White told Datanami. “Nobody had attempted to put that in one clearinghouse type of place. At the time, looking at Hadoop, that stack seemed the right direction to go.”
Most of the data Blackboard was working with was in the JSON format. After adopting Hadoop, Blackboard switched to Avro, another semi-structured data format that’s at home in Hadoop. Everything seemed to be on track during the initial data ingestion phase. But problems soon surfaced when it came time to read data from Hadoop.
Trouble in the Reads
Getting into Hadoop, or writing data to the lake, has traditionally been easy. But getting the data back out, or reading it, has been a recurring problem expressed by many users.
“Reporting directly out of Hadoop, we were never able to find the right combination of tools,” White said. “We ended up building a series of very complex processes to, on a regular basis, push that data out to a relational database stack of back-end BI tools.”
Blackboard was moving JSON data exhaust from a relational database into Hadoop and then back out into a Postgres database, where it could be analyze using SQL. But that convoluted workflow didn’t sit right with White.
“Architecturally, it always just felt cumbersome to have all this data being sourced from relational database, pushed into a Hadoop stack, and then sharded back out into relational databases. It was all just an entirely complicated architectural diagram with very heavy DevOps lift.”
When Blackboard started working on the Hadoop project, it had a team of 10 engineers, with two of them dedicated to wrangling the various open source tools, patching bugs, and building work-arounds. That was more than White expected.
“The DevOps lift was heavier than we anticipated,” White explains. “We were not using a packaged distribution of Hadoop. We were attempting to roll our own and find all the right farm animals the suites of various Hadoop environment-related products to accomplish all the things we needed.”
Blackboard eventually decided to shut down the Hadoop cluster when it became clear that the approach they were taking was not cost effective. When Blackboard finally shut down the cluster, it had about 50TB of data, he estimated.
“Once we realized the burden that we have gotten ourselves into in terms of the care and feeding required to keep that environment up and running we quickly started looking for another solution,” he said.
Don’t Believe the Hype
Did Blackboard get suckered into the hype around Hadoop?
“I definitely think that’s a big part of it,” he answered. “I had three or four engineers on that team that are top-notch engineers who were really excited to go play with Hadoop because they had heard about it and wanted to check it out. That’s really what carried us so far down the road before coming to the realization that this wasn’t the right application for what we were trying to accomplish.”
White acknowledged that his decision to go the plain vanilla Apache Hadoop route may have played a role in the difficulty Blackboard experienced, and that choosing one of the packaged Hadoop distributions may have alleviated some of the burden.
Nonetheless, Blackboard looked elsewhere, including Redshift warehouse hosted on Amazon Web Services and an on-premise Vertica cluster. Eventually the company adopted a hosted SQL data warehouse from Snowflake Computing.
Back to SQL
Snowflake handles all of Blackboard’s needs in terms of powering the core BI reporting from the learning management systems, as well as providing a place for Blackboard’s data scientist to explore the exhaust data and power predictive tools, White said. Like many companies, Blackboard has many employees who are very experienced in using SQL, and they took to Snowflake very quickly.
But Snowflake is also serving Blackboard’s needs for non-SQL workloads, such as developing predictive machine learning algorithms in R and Python.
“We’re in the process of capturing as much application exhaust from all the various products in Blackboard suite into the data lake for that kind of analysis,” White said. “In addition to that, we’re using Snowflake to do the data science research. Those guys love it because of the capabilities to use R natively.”
Many of Blackboard’s data scientists develop machine learning algorithms in Python outside of Snowflake, and White said Snowflake’s environment makes it easy to implement those algorithms for predictive purposes. “It all works really well for them,” he said.
Perhaps most importantly, no Blackboard engineers are involved with maintaining the data lake in Snowflake, which today holds about 100TB and will soon be expanded to more than 1 petabyte, White said.
“Today we’ve got probably 40 people doing something related to Snowflake, and we don’t have a single person dedicate to any kind of operations,” he said. “The biggest thing for us was our ability to rapidly innovate with Snowflake and really be agile with what we’re trying to do. To me that’s where the real cost-savings comes in.”
White acknowledged that many other organizations have been very successful with Hadoop which he called “a powerful and important set of technologies.” However, he warned others not to make the mistake in thinking that because Hadoop is open source that it’s free. Being successful with Hadoop takes a big commitment.
“My advice would be to not underestimate the DevOps lift required to make a big data project based on Hadoop successful,” he said. “If you’re willing to invest in that, I think you can do great things with Hadoop. I’m not against it as a technology. It’s just it required a significant investment. It’s free like a puppy, not free like a beer.”