Yahoo is in the process of implementing a big data tool called Druid to power high-speed real-time queries against its massive Hadoop-based data lake. Engineers at the Web giant say the open source database’s combination of speed and usability on fast-moving data make it ideal for the job.
Druid is a column-oriented in-memory OLAP data store that was originally developed more than four years ago by the folks at Metamarkets, a developer of programmatic advertising solutions. The company was struggling to keep the Web-based analytic consoles it provides customers fed with the latest clickstream data using relational tools like Greenplum and NoSQL databases like HBase, so it developed its own distributed database instead.
The core design parameter for Druid was being able to compute drill-downs and roll-ups over a large set of “high dimensional” data comprising billions of events, and to do so in real time, Druid creator, Eric Tschetter wrote in a 2011 blog post introducting Druid. To accomplish this, Tschetter decided that Druid would feature a parallelized, in-memory architecture that scaled out, enabling users to easily add more memory as needed.
Druid essentially maps the data to memory as it arrives, compresses it into a column, and then builds indexes for each column. It also maintains two separate subsystems: a read-optimized subsystem in the historical nodes, and a write-optimized subsystem in real-time nodes (hence the name “Druid,” a shape-shifting character common in role playing games). This approach lets the database query very large amounts of historical and real-time data, says Tschetter, who left Metamarkets to join Yahoo in late 2014.
“Druid’s power resides in providing users fast, arbitrarily deep exploration of large-scale transaction data,” Tschetter writes. “Queries over billions of rows, that previously took minutes or hours to run, can now be investigated directly with sub-second response times.”
Metamarkets released Druid as an open source project on GitHub in October 2012. Since then, the software has been used by a number of companies for various purposes, including as a video network monitoring, operations monitoring, and online advertising analytics platform, according to a 2014 white paper.
Netflix was one of the early companies testing Druid, but it’s unclear if it implemented it into production. One company that has adopted Druid is Yahoo, the ancestral home of Hadoop. Yahoo is now using Druid to power a variety of real-time analytic interfaces, including executive-level dashboards and customer-facing analytics, according to a post last week on the Yahoo Engineering blog.
Yahoo engineers explain Druid in this manner:
“The architecture blends traditional search infrastructure with database technologies and has parallels to other closed-source systems like Google’s Dremel, Powerdrill and Mesa. Druid excels at finding exactly what it needs to scan for a query, and was built for fast aggregations over arbitrary slice-and-diced data. Combined with its high availability characteristics, and support for multi-tenant query workloads, Druid is ideal for powering interactive, user-facing, analytic applications.”
Yahoo landed on Druid after attempting to build its data applications using various infrastructure pieces, including Hadoop and Hive, relational databases, key/value stores, Spark and Shark, Impala, and many others. “The solutions each have their strengths,” Yahoo wrote, “but none of them seemed to support the full set of requirements that we had,” which included adhoc slice and dice, scaling to tens of billions of events a day, and ingestion of data in real-time.
Another property of Druid that caught Yahoo’s eye was its “lock-free, streaming ingestion capabilities.” The capability to work with open source big data messages busses, like Kafka, as well as working with proprietary systems, means it fits nicely into its stack, Yahoo said. “Events can be explored milliseconds after they occur while providing a single consolidated view of both real-time events and historical events that occurred years in the past,” the company writes.
As it does for all open source products that it finds useful, Yahoo is investing in Druid. For more info, see the Druid website at http://druid.io.
Related Items:
The Real-Time Future of Data According to Jay Kreps
Glimpsing Hadoop’s Real-Time Analytic Future
Druid Summons Strength in Real-Time
April 26, 2024
- Google Announces $75M AI Opportunity Fund and New Course to Skill One Million Americans
- Elastic Reports 8x Speed and 32x Efficiency Gains for Elasticsearch and Lucene Vector Database
- Gartner Identifies the Top Trends in Data and Analytics for 2024
- Satori and Collibra Accelerate AI Readiness Through Unified Data Management
- Argonne’s New AI Application Reduces Data Processing Time by 100x in X-ray Studies
April 25, 2024
- Salesforce Unveils Zero Copy Partner Network, Offering New Open Data Lake Access via Apache Iceberg
- Dataiku Enables Generative AI-Powered Chat Across the Enterprise
- IBM Transforms the Storage Ownership Experience with IBM Storage Assurance
- Cleanlab Launches New Solution to Detect AI Hallucinations in Language Models
- University of Maryland’s Smith School Launches New Center for AI in Business
- SAS Advances Public Health Research with New Analytics Tools on NIH Researcher Workbench
- NVIDIA to Acquire GPU Orchestration Software Provider Run:ai
April 24, 2024
- AtScale Introduces Developer Community Edition for Semantic Modeling
- Domopalooza 2024 Sets a High Bar for AI in Business Intelligence and Analytics
- BigID Highlights Crucial Security Measures for Generative AI in Latest Industry Report
- Moveworks Showcases the Power of Its Next-Gen Copilot at Moveworks.global 2024
- AtScale Announces Next-Gen Product Innovations to Foster Data-Driven Industry-Wide Collaboration
- New Snorkel Flow Release Empowers Enterprises to Harness Their Data for Custom AI Solutions
- Snowflake Launches Arctic: The Most Open, Enterprise-Grade Large Language Model
- Lenovo Advances Hybrid AI Innovation to Meet the Demands of the Most Compute Intensive Workloads
Most Read Features
Sorry. No data so far.
Most Read News In Brief
Sorry. No data so far.
Most Read This Just In
Sorry. No data so far.
Sponsored Partner Content
-
Get your Data AI Ready – Celebrate One Year of Deep Dish Data Virtual Series!
-
Supercharge Your Data Lake with Spark 3.3
-
Learn How to Build a Custom Chatbot Using a RAG Workflow in Minutes [Hands-on Demo]
-
Overcome ETL Bottlenecks with Metadata-driven Integration for the AI Era [Free Guide]
-
Gartner® Hype Cycle™ for Analytics and Business Intelligence 2023
-
The Art of Mastering Data Quality for AI and Analytics
Sponsored Whitepapers
Contributors
Featured Events
-
AI & Big Data Expo North America 2024
June 5 - June 6Santa Clara CA United States -
CDAO Canada Public Sector 2024
June 18 - June 19 -
AI Hardware & Edge AI Summit Europe
June 18 - June 19London United Kingdom -
AI Hardware & Edge AI Summit 2024
September 10 - September 12San Jose CA United States -
CDAO Government 2024
September 18 - September 19Washington DC United States