Solving Hadoop Problems, For Fun and Profit
Things move quickly in the Hadoop world, and keeping up can be hard to do. Just ask Chris Wensel, the creator of the popular open source development tool Cascading and CTO at Concurrent. While Wensel spends many hours keeping Cascading current with every Hadoop release as a service to the community, he’s got bigger fish to fry solving production Hadoop problems in enterprise accounts.
“I spend a lot of CPU cycles and dollars on Amazon testing Cascading on every vendor distribution and many, many Hadoop releases to make sure that everything is kosher,” Wensel told Datanami in an interview at the recent Hadoop Summit. “Because I have users running in Hadoop 20.205. I have a massive number of users still running on Hadoop 1…People aren’t migrating.”
Giving Hadoop users the option to migrate—whether from Hadoop 1 to Hadoop 2, from MapReduce to Tez, or from Cloudera‘s distro to Hortonworks‘ or vice versa—is one of the key benefits that people get when they adopt Cascading. The tool essentially presents a layer of abstraction between the application developer and Hadoop. So instead of writing to the Hadoop APIs, which are constantly changing, they can write to Cascading’s APIs (available in Java, Python, and Scala flavors, among others) and deploy the application using whichever Hadoop data fabric meets their needs.
By all accounts, Cascading is one of the most popular tools in the budding Hadoop ecosystem. According to Concurrent CEO Gary Nakamura, it’s being downloaded 275,000 times per month, and is used in about 10,000 Hadoop clusters, which is a mind-boggling number when you consider that Cloudera, Hortonworks, and MapR together account for only 2,000 production Hadoop clusters. (Hint: Amazon is the unseen giant in the Hadoop world.)
To Nakamura’s way of thinking, Cascading will be instrumental in fueling the application-building boom that will characterize the next phase of Hadoop adoption. “This next phase is going to be about building applications,” Nakamura says. “It’s going to be very important for enterprise to pick a platform that doesn’t change out from underneath their feet every 90 days. Hadoop has been that fluid for the last couple years…”
The latest release of Concurrent, version 3.0, delivers full support for Apache Tez, which is one of the Hadoop-resident data fabrics that will replace the old MapReduce engine that characterized first-gen Hadoop clusters. The company plans to unveil support for Apache Spark later this year with another update to Cascading’s query planner, giving users even more options for how and where to run their Hadoop jobs.
While Cascading helps shield developers from the complexity of Hadoop development, the product doesn’t address other elements of the Hadoop experience that are generating complexity—namely the monitoring of production clusters and the data-driven applications that run atop them. That’s where Concurrent’s other tool, called Driven, comes into play.
According to Wensel, Driven automatically monitors the performance of Hadoop clusters and data-driven workflow applications, and helps pinpoint problems that often crop up in distributed systems, such as having low parallel efficiency, bottlenecks in the reducers, jobs running in the wrong queue, being overprovisioned on limited hardware, or users who are running their jobs at a higher priority than allowed.
“You can boot up 1,000 reducers, but if only one reducer is seeing data, you just locked up your whole cluster and it’s going to take forever to get anything done,” Wensel says. “That’s a common problem. Does Hadoop tell you that’s happening? No, not really. You have to go figure that out post mortem.”
Driven essentially works by building a directed acyclic graph (DAG) graph that tracks the operators involved and the inputs and outputs to the application. The first release of Driven was designed to pull in metadata from applications built using Cascading, but it will support more development environments in the future with an open API.
“We bring it up a level,” Wensel says. “Usually it’s figuring out whether [the problem] is code, is it configuration, is it the shape of the data, or the lack of resources.”
The enterprise IT world may have lots of mature performance monitoring tools to pick from, but those tools are largely too green in the Hadoop world. “We’ve watched this happen a couple of times. Look at the J2EE stack and all the things that happened there,” Wensel says. “We need to do the same things for this class of technology, but we need to adopt the philosophies behind Hadoop and the open source community, the way they’re doing things. Let the bureaucracy and the organization get out of your way and let you write applications as productively as possible, thus Cascading. And then have something to tell you how you did it,” which is where Driven comes into play.
While Cascading is a free and open source product, Driven is neither free nor open source. Wensel is counting on Driven being adopted by enterprises to help pay the bills for supporting Cascading. “Driven is the money marker,” he says. “Cascading–if it made money I wouldn’t have bothered with Driven. I wouldn’t be here–I’d be in the mountains fly-fishing.”
For someone who contributes so much to the open source community, Wensel is rather outspoken about the effect that open source is having on jobs. “We’re destroying our middle-class by writing open source at the end of the day,” he says. “So you got to find a way to do what you love, build the open source, but you have to figure out a way to monetize it in some way.”
If Driven is 10 percent as successful as Cascading, Wensel won’t have to worry much about the monetization issue. By all accounts, it appears that Driven fills a spot in the Hadoop stack that’s not being adequately addressed in a simple and cohesive manner by the various projects at the Apache Software Foundation. Time will tell if it resonates in the market. And in the meantime, hopefully the regression testing for Cascading doesn’t eat too far into Wensel’s mountain time.