March 12, 2014

Hadoop Alternative Hydra Re-Spawns as Open Source

Alex Woodie

It may not have the name recognition or momentum of Hadoop. But Hydra, the distributed task processing system first developed six years ago by the social bookmarking service maker AddThis, is now available under an open source Apache license, just like Hadoop. And according to Hydra’s creator, the multi-headed platform is very good at some big data tasks that the yellow pachyderm struggles with–namely real-time processing of very big data sets.

Hydra is a big data storage and processing platform developed by Matt Abrams and his colleagues at AddThis (formerly Clearspring), the company that develops the Web server widgets that allow visitors to easily share something via their Twitter, Facebook, Pintrest, Google+, or Instagram accounts.

When AddThis started scaling up its business in the mid-2000s, it got flooded with data about what users were sharing. The company needed a scalable, distributed system that could deliver real-time analysis of that data to its customers. Hadoop wasn’t a feasible option at that time. So it built Hydra instead.

So, what is Hydra? In short, it’s a distributed task processing system that supports streaming and batch operations. It utilizes a tree-based data structure to store and process data across clusters with thousands of individual nodes. It features a Linux-based file system, which makes it compatible with ext3, ext4, or even ZFS. It also features a job/cluster management component that automatically allocates new jobs to the cluster and rebalance existing jobs. The system automatically replicates data and handles node failures automatically.

The tree-based structure allows it to handle streaming and batch jobs at the same time. In his January 23 blog post announcing that Hydra is now open source, Chris Burroughs, a member of AddThis’ engineering department, provided this useful description of Hydra: “It ingests streams of data (think log files) and builds trees that are aggregates, summaries, or transformations of the data. These trees can be used by humans to explore (tiny queries), as part of a machine learning pipeline (big queries), or to support live consoles on websites (lots of queries).”

Hydra was originally developed to help AddThis answer questions about its data, for internal use as well as a service to website operators. Examples of typical questions include “How many unique visitors did I have last month?” and “How many page views did we get from [enter country, Web browser, etc.)?”

AddThis continues to use Hydra to process its massive data flow and tell its customers about Website trends. AddThis is in a unique position to see what people are sharing online and what topics are hot. The social bookmarking service is installed on more than 13 million web domains and reaches 1.3 billion unique users a month, throwing off 10TB of data from an average of 3 billion page views per day. At any one time, Hydra is running on a thousand nodes at AddThis.

“We’ve been dealing with very large data sets for a long time,” Abrams tells Datanami via email.” Hydra has been tremendously useful for us and we feel it solves the distributed data processing problem in a unique way.”

Whereas traditional Hadoop is batch-oriented, Hydra can support both batch and real-time streaming operations. “While Hydra supports batch processing it is primarily focused on stream analysis and incremental data processing,” Abrams says. “The ability to represent data in tree data structures provides natural data compression and efficient query access. Hydra can produce and consume data from HDFS but it operates on native file systems which makes it easy to colocate Hydra with other services.”

Now that Hydra is open source, Abrams hopes that the software will see more widespread use and will generate a bigger following. “It will take some time but we believe that we can build a community around Hydra. Then AddThis and the OS [open source] community can benefit from future enhancements. In the [Washington] DC area there is already some production use of Hydra outside of AddThis, so we are excited to see the community grow.”

Last fall, Doug Cutting, the creator of Hadoop and chief architect at Cloudera, lamented at the lack of alternatives to Hadoop. “I expected there to be multiple systems like Hadoop….and really nothing else has emerged,” Cutting said at the time. While Hadoop certainly dominates big data discussions today, who’s to say that it will be the only big data platform for distributed computing? Perhaps there’s room for additional heads in this discussion.

Hydra can be downloaded from its GitHub page.

Related Items:

Rethinking Real-Time Hadoop

When to Hadoop, and When Not To

OLTP Clearly in Hadoop’s Future, Cutting Says

Applications: Data Mining, Enterprise Analytics

Technologies: Middleware

Sectors: Other

Vendors: Startups and More...

Tags: Hadoop, hydra

Only registered users may comment. Register using the form below.

Check off newsletters you would like to receive*
- HPCwire
- EnterpriseTech
- Datanami
- Technology Conferences & Events
- Advanced Computing Job Bank
- Technology Product Showcase
Email*
Name*
First Last
Organization*
Job Function*
Industry*
Country*
City*
State*
Province*
- Please check here to receive valuable email offers from Datanami on behalf of our select partners.

Hadoop Alternative Hydra Re-Spawns as Open Source

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 19, 2024

April 18, 2024

April 17, 2024

Sponsored Partner Content

Get your Data AI Ready – Celebrate One Year of Deep Dish Data Virtual Series!

Supercharge Your Data Lake with Spark 3.3

Learn How to Build a Custom Chatbot Using a RAG Workflow in Minutes [Hands-on Demo]

Overcome ETL Bottlenecks with Metadata-driven Integration for the AI Era [Free Guide]

Gartner® Hype Cycle™ for Analytics and Business Intelligence 2023

The Art of Mastering Data Quality for AI and Analytics

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Building an Operational Data Warehouse for Real-time Analytics

Can You Use Kafka as a Database?

Sponsored Multimedia

The Power of DataOps: Bring Automation to Life
No Comments

Tactical Steps for Cloud Migration
No Comments

Immuta Data Access Platform
No Comments

Data Mesh: Fact or Fiction?
No Comments

Contributors

Featured Events

Call & Contact Center Expo

AI & Big Data Expo North America 2024

AI Hardware & Edge AI Summit 2024

CDAO Government 2024

Hadoop Alternative Hydra Re-Spawns as Open Source

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 19, 2024

April 18, 2024

April 17, 2024

Most Read Features

Most Read News In Brief

Most Read This Just In

Sponsored Partner Content

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Sponsored Multimedia

Contributors

Featured Events

Share

Copy short link