Hadoop Alternative Hydra Re-Spawns as Open Source
It may not have the name recognition or momentum of Hadoop. But Hydra, the distributed task processing system first developed six years ago by the social bookmarking service maker AddThis, is now available under an open source Apache license, just like Hadoop. And according to Hydra’s creator, the multi-headed platform is very good at some big data tasks that the yellow pachyderm struggles with–namely real-time processing of very big data sets.
Hydra is a big data storage and processing platform developed by Matt Abrams and his colleagues at AddThis (formerly Clearspring), the company that develops the Web server widgets that allow visitors to easily share something via their Twitter, Facebook, Pintrest, Google+, or Instagram accounts.
When AddThis started scaling up its business in the mid-2000s, it got flooded with data about what users were sharing. The company needed a scalable, distributed system that could deliver real-time analysis of that data to its customers. Hadoop wasn’t a feasible option at that time. So it built Hydra instead.
So, what is Hydra? In short, it’s a distributed task processing system that supports streaming and batch operations. It utilizes a tree-based data structure to store and process data across clusters with thousands of individual nodes. It features a Linux-based file system, which makes it compatible with ext3, ext4, or even ZFS. It also features a job/cluster management component that automatically allocates new jobs to the cluster and rebalance existing jobs. The system automatically replicates data and handles node failures automatically.
The tree-based structure allows it to handle streaming and batch jobs at the same time. In his January 23 blog post announcing that Hydra is now open source, Chris Burroughs, a member of AddThis’ engineering department, provided this useful description of Hydra: “It ingests streams of data (think log files) and builds trees that are aggregates, summaries, or transformations of the data. These trees can be used by humans to explore (tiny queries), as part of a machine learning pipeline (big queries), or to support live consoles on websites (lots of queries).”
Hydra was originally developed to help AddThis answer questions about its data, for internal use as well as a service to website operators. Examples of typical questions include “How many unique visitors did I have last month?” and “How many page views did we get from [enter country, Web browser, etc.)?”
AddThis continues to use Hydra to process its massive data flow and tell its customers about Website trends. AddThis is in a unique position to see what people are sharing online and what topics are hot. The social bookmarking service is installed on more than 13 million web domains and reaches 1.3 billion unique users a month, throwing off 10TB of data from an average of 3 billion page views per day. At any one time, Hydra is running on a thousand nodes at AddThis.
“We’ve been dealing with very large data sets for a long time,” Abrams tells Datanami via email.” Hydra has been tremendously useful for us and we feel it solves the distributed data processing problem in a unique way.”
Whereas traditional Hadoop is batch-oriented, Hydra can support both batch and real-time streaming operations. “While Hydra supports batch processing it is primarily focused on stream analysis and incremental data processing,” Abrams says. “The ability to represent data in tree data structures provides natural data compression and efficient query access. Hydra can produce and consume data from HDFS but it operates on native file systems which makes it easy to colocate Hydra with other services.”
Now that Hydra is open source, Abrams hopes that the software will see more widespread use and will generate a bigger following. “It will take some time but we believe that we can build a community around Hydra. Then AddThis and the OS [open source] community can benefit from future enhancements. In the [Washington] DC area there is already some production use of Hydra outside of AddThis, so we are excited to see the community grow.”
Last fall, Doug Cutting, the creator of Hadoop and chief architect at Cloudera, lamented at the lack of alternatives to Hadoop. “I expected there to be multiple systems like Hadoop….and really nothing else has emerged,” Cutting said at the time. While Hadoop certainly dominates big data discussions today, who’s to say that it will be the only big data platform for distributed computing? Perhaps there’s room for additional heads in this discussion.
Hydra can be downloaded from its GitHub page.