Follow Datanami:
April 11, 2013

Hadoop Data Management Set to Fly with Falcon

Isaac Lopez

Hadoop data management tools for the enterprise are on their way says a team of open source developers at Hortonworks and InMobi, who recently saw their project, dubbed Falcon, accepted as an Apache Software Foundation incubator project.

“There are two sets of problems [that Falcon addresses],” explained Hortonworks CTO, Eric Baldeschwieler in a recent keynote at the Hadoop Summit in Amsterdam. “One is data life cycle and data movement. How do you get data into the cluster – how do you move it between clusters and make sure that you keep the data in the right place for the right amount of time. The other is how do you automate ETL flows in a much simpler, more declarative fashion.”

According to the Apache page outlining the Falcon proposal, enterprises using Falcon will be able to relatively easily set up Falcon using declarative mechanisms to define infrastructure endpoints, data sets and processing rules. With dependencies between the configured entities explicitly defined, Falcon will then orchestrate data management functions automatically. 

“If you look at where Hadoop is in its adoption lifecycle, these needs are starting to really emerge this year,” explains Shaun Connolly, VP of Corporate Strategy with Hortonworks saying that there is a growing need for this type of data management. “Increasingly over the last 6 to 9 months, we’ve seen an increase in more mainstream enterprises that have been embracing Hadoop for various needs. Now that they have a Hadoop cluster or two running, they’re going to double back and basically say ‘now how do I operationalize this.”

Currently these processes are being handled by the early adopters of Hadoop in disparate ways by IT teams who manually code them, explained Connolly – a process which can be tedious and prone to error. He says that once recognizing this gap, they moved to plug it with the open source Falcon solution, which Connolly says automate the processing of data lifecycle management scenarios in predictable and reliable ways.

“What Falcon does is provide a framework for addressing [data lifecycle management] needs within the context of Apache Hadoop, but it also provides a set of open APIs that enable those workflows to be orchestrated more broadly, so if you want to orchestrate data lifecycle workflows within Hadoop as well as with your Teradata system (as an example) concurrently, then enterprises would use the Falcon API from those other tools and be able to drive those workflows indirectly.”

A recent post on the Hortonworks website by Falcon contributor, Venkatesh Seetharam illustrates Falcon’s role as a data management tool for Hadoop:

While the Falcon project was just recently added as an Apache Software Foundation incubator project, the code itself is presently beginning its second year of maturity after having been developed by mobile ad network company, InMobi.

InMobi built the Falcon framework to scratch their own internal data management itch. According the mobile-oriented ad platform developer, their network receives in excess of 10 billion events (ad-serving and related) every day through multiple sources/streams originating from over ten geographically distributed data centers, requiring the processing tens of terabytes of data a day.

“As we explored cheaper and more effective ways of processing this huge amount of data, we came up with a simple in-house scheduler to manage job flows in our environment then,” explains Mohit Saxena at InMobi. “We realized that to be able to process data in a decentralized fashion, we needed to have the complexity pushed into a platform and allow the engineers to focus on the processing / business logic.” 

After having developed the framework, InMobi engineers approached Hortonworks engineers about working together to bring it to the Apache for incubation and acceleration. According to Saxena, Falcon has been widely used for various processing pipelines and data management functions including SLA critical feedback pipelines, correctness critical revenue pipelines, and other applications over the last year, getting a heavy workout prior to its Apache incubation.

While Connolly was reluctant to give us an expected launch frame, he did hint that they didn’t expect that there would be a prolonged wait.

“The net out of it is that it’s been deployed in production at InMobi for about a year, so there’s some significant technology there, and the goal is to accelerate adoption.”

Related items:

Baldeschwieler: Looking at the Future of Hadoop 

Deployment and Active Management of Hadoop in the Clouds 

How Facebook Fed Big Data Continuuity