Yahoo Casts Real-Time OLAP Queries with Druid
Yahoo is in the process of implementing a big data tool called Druid to power high-speed real-time queries against its massive Hadoop-based data lake. Engineers at the Web giant say the open source database’s combination of speed and usability on fast-moving data make it ideal for the job.
Druid is a column-oriented in-memory OLAP data store that was originally developed more than four years ago by the folks at Metamarkets, a developer of programmatic advertising solutions. The company was struggling to keep the Web-based analytic consoles it provides customers fed with the latest clickstream data using relational tools like Greenplum and NoSQL databases like HBase, so it developed its own distributed database instead.
The core design parameter for Druid was being able to compute drill-downs and roll-ups over a large set of “high dimensional” data comprising billions of events, and to do so in real time, Druid creator, Eric Tschetter wrote in a 2011 blog post introducting Druid. To accomplish this, Tschetter decided that Druid would feature a parallelized, in-memory architecture that scaled out, enabling users to easily add more memory as needed.
Druid essentially maps the data to memory as it arrives, compresses it into a column, and then builds indexes for each column. It also maintains two separate subsystems: a read-optimized subsystem in the historical nodes, and a write-optimized subsystem in real-time nodes (hence the name “Druid,” a shape-shifting character common in role playing games). This approach lets the database query very large amounts of historical and real-time data, says Tschetter, who left Metamarkets to join Yahoo in late 2014.
“Druid’s power resides in providing users fast, arbitrarily deep exploration of large-scale transaction data,” Tschetter writes. “Queries over billions of rows, that previously took minutes or hours to run, can now be investigated directly with sub-second response times.”
Metamarkets released Druid as an open source project on GitHub in October 2012. Since then, the software has been used by a number of companies for various purposes, including as a video network monitoring, operations monitoring, and online advertising analytics platform, according to a 2014 white paper.
Netflix was one of the early companies testing Druid, but it’s unclear if it implemented it into production. One company that has adopted Druid is Yahoo, the ancestral home of Hadoop. Yahoo is now using Druid to power a variety of real-time analytic interfaces, including executive-level dashboards and customer-facing analytics, according to a post last week on the Yahoo Engineering blog.
Yahoo engineers explain Druid in this manner:
“The architecture blends traditional search infrastructure with database technologies and has parallels to other closed-source systems like Google’s Dremel, Powerdrill and Mesa. Druid excels at finding exactly what it needs to scan for a query, and was built for fast aggregations over arbitrary slice-and-diced data. Combined with its high availability characteristics, and support for multi-tenant query workloads, Druid is ideal for powering interactive, user-facing, analytic applications.”
Yahoo landed on Druid after attempting to build its data applications using various infrastructure pieces, including Hadoop and Hive, relational databases, key/value stores, Spark and Shark, Impala, and many others. “The solutions each have their strengths,” Yahoo wrote, “but none of them seemed to support the full set of requirements that we had,” which included adhoc slice and dice, scaling to tens of billions of events a day, and ingestion of data in real-time.
Another property of Druid that caught Yahoo’s eye was its “lock-free, streaming ingestion capabilities.” The capability to work with open source big data messages busses, like Kafka, as well as working with proprietary systems, means it fits nicely into its stack, Yahoo said. “Events can be explored milliseconds after they occur while providing a single consolidated view of both real-time events and historical events that occurred years in the past,” the company writes.
As it does for all open source products that it finds useful, Yahoo is investing in Druid. For more info, see the Druid website at http://druid.io.
June 2, 2020
- Ahana Raises $2.25M Seed Funding, Joins Linux Foundation’s Presto Foundation
- WekaFS on the HPE Apollo 2000 Gen10 System Sets Five New Benchmark Records
- Argonne’s New Menu of Data Storage Software Helps Scientists Realize Findings Earlier
- MariaDB Survey Reveals COVID-19’s Impact on Cloud Adoption
- Cloudian Announces HyperIQ Solution
- TIBCO Spotfire, Data Science Solutions Now Support Microsoft Azure Cognitive Services
- Impetus Technologies Achieves AWS Advanced Consulting Partner Status
- UltraSoC Enables Ultra-High-Speed Closed-Chassis Analytics, Debug Over Synopsys USB3
June 1, 2020
- Report Reveals 49% of Companies Use Analytics More or Much More Than Before COVID-19
- Esri Donates Free Software to GEO BON Grant Recipients
- Arcadia Emphasizes the Role of Custom Software Development in Global Medical Safety
- WANdisco Releases LiveData Platform for Petabyte Scale Cloud Migration to Microsoft Azure
May 29, 2020
- UW–Madison VisPy Data Visualization Project Awarded Chan Zuckerberg Initiative Grant
- Domo Releases Data Explorer Feature on its Interactive COVID-19 Global Tracker
- Catalytic Data Science Joins the XPRIZE Pandemic Alliance to Combat COVID-19
- MetiStream Secures Funding to Enhance Solution that Analyses Patient, Population Data Using NLP and AI
- Kyvos Announces Snowflake Integration Enabling Multidimensional Analytics on the Cloud
- Survey: Despite Reduced IT Budgets Due to COVID-19, IT Decision-Makers Continue Cloud and Analytics Investments
- Gravy Analytics Partners with Nitrogen.ai to Correlate Foot Traffic, Socio-Economic Data
- Siren Releases 10.5 with Knowledge Graph Augmentation on Demand, NLP and Position Tracking
Most Read Features
- Big Data File Formats Demystified
- Spark 3.0 to Get Native GPU Acceleration
- Google Enters Data Catalog Business, Updates BigQuery
- How COVID-19 Is Impacting the Market for Data Jobs
- How to Build a Better Machine Learning Pipeline
- COVID-19 Has a Data Governance Problem
- Tracking the Spread of Coronavirus with Graph Databases
- The Big Cloud Data Boom Gets Even Bigger, Thanks to COVID-19
- Detecting Consumer Signals in the 90% Economy
- Is Python Strangling R to Death?
- More Features…
Most Read News In Brief
- New Map Shows Hundreds of Counties in the COVID-19 Endgame — and Thousands on the Uptick
- New MIT Analytics Tools Predict COVID-19 Patient Outcomes and More
- New COVID-19 Model Shows Peak Scenarios for Your State
- COVID-19 Spurs Offers for Free Software, Data, and Training
- For American Airlines, Machine Learning Solves an Air Cargo Conundrum
- War Unfolding for Control of Elasticsearch
- Inside Fortnite’s Massive Data Analytics Pipeline
- Dashboard Tracks Economic Impact of COVID-19
- Data Science, ML Platform Leader Board Shuffled
- Why Gartner Dropped Big Data Off the Hype Curve
- More News In Brief…
Most Read This Just In
- Womply, Opportunity Insights Partner to Launch Real-Time Economic Tracker for COVID-19 Impact
- Esri Provides Free Mapping Software for Women in GIS
- Iguazio and NetApp Collaborate to Accelerate Deployment of AI Applications
- Iguazio Becomes Certified for NVIDIA DGX-Ready Software Program
- VisionLabs to Hold Online ‘Machine Can See’ Summit
- GoodData Announces New Collaborative Data Modeling Solution
- CData Software, LANSA Partner to Extend Low-Code App Development Platform with Full-Spectrum Data Connectivity
- Dotscience is Shutting Down
- Dremio Introduces AWS Edition, Shrinks Data Lake Query Engine Costs by 90%
- The Turing to Work with the University of Texas at Austin’s Oden Institute to Advance Data-Centric Engineering Research
- More This Just In…
June 22 - June 26