Yahoo Casts Real-Time OLAP Queries with Druid
Yahoo is in the process of implementing a big data tool called Druid to power high-speed real-time queries against its massive Hadoop-based data lake. Engineers at the Web giant say the open source database’s combination of speed and usability on fast-moving data make it ideal for the job.
Druid is a column-oriented in-memory OLAP data store that was originally developed more than four years ago by the folks at Metamarkets, a developer of programmatic advertising solutions. The company was struggling to keep the Web-based analytic consoles it provides customers fed with the latest clickstream data using relational tools like Greenplum and NoSQL databases like HBase, so it developed its own distributed database instead.
The core design parameter for Druid was being able to compute drill-downs and roll-ups over a large set of “high dimensional” data comprising billions of events, and to do so in real time, Druid creator, Eric Tschetter wrote in a 2011 blog post introducting Druid. To accomplish this, Tschetter decided that Druid would feature a parallelized, in-memory architecture that scaled out, enabling users to easily add more memory as needed.
Druid essentially maps the data to memory as it arrives, compresses it into a column, and then builds indexes for each column. It also maintains two separate subsystems: a read-optimized subsystem in the historical nodes, and a write-optimized subsystem in real-time nodes (hence the name “Druid,” a shape-shifting character common in role playing games). This approach lets the database query very large amounts of historical and real-time data, says Tschetter, who left Metamarkets to join Yahoo in late 2014.
“Druid’s power resides in providing users fast, arbitrarily deep exploration of large-scale transaction data,” Tschetter writes. “Queries over billions of rows, that previously took minutes or hours to run, can now be investigated directly with sub-second response times.”
Metamarkets released Druid as an open source project on GitHub in October 2012. Since then, the software has been used by a number of companies for various purposes, including as a video network monitoring, operations monitoring, and online advertising analytics platform, according to a 2014 white paper.
Netflix was one of the early companies testing Druid, but it’s unclear if it implemented it into production. One company that has adopted Druid is Yahoo, the ancestral home of Hadoop. Yahoo is now using Druid to power a variety of real-time analytic interfaces, including executive-level dashboards and customer-facing analytics, according to a post last week on the Yahoo Engineering blog.
Yahoo engineers explain Druid in this manner:
“The architecture blends traditional search infrastructure with database technologies and has parallels to other closed-source systems like Google’s Dremel, Powerdrill and Mesa. Druid excels at finding exactly what it needs to scan for a query, and was built for fast aggregations over arbitrary slice-and-diced data. Combined with its high availability characteristics, and support for multi-tenant query workloads, Druid is ideal for powering interactive, user-facing, analytic applications.”
Yahoo landed on Druid after attempting to build its data applications using various infrastructure pieces, including Hadoop and Hive, relational databases, key/value stores, Spark and Shark, Impala, and many others. “The solutions each have their strengths,” Yahoo wrote, “but none of them seemed to support the full set of requirements that we had,” which included adhoc slice and dice, scaling to tens of billions of events a day, and ingestion of data in real-time.
Another property of Druid that caught Yahoo’s eye was its “lock-free, streaming ingestion capabilities.” The capability to work with open source big data messages busses, like Kafka, as well as working with proprietary systems, means it fits nicely into its stack, Yahoo said. “Events can be explored milliseconds after they occur while providing a single consolidated view of both real-time events and historical events that occurred years in the past,” the company writes.
As it does for all open source products that it finds useful, Yahoo is investing in Druid. For more info, see the Druid website at http://druid.io.
May 13, 2021
- Esri’s ArcGIS Insights Introduces New Cloud-Native Database Accessibility Features
- Provectus Announces Partnership with Tecton to Collaborate on ML Feature Store
- Informatica Announces Free Service to Kick-Start Data-Led Migration on AWS
- ORNL Invites Student Scientists, Experts to Enter Smoky Mountains Data Challenge
- Leading Companies Use Neo4j to Enhance Cybersecurity
- Hivecell Partners with DataRobot to Empower the Enterprise to Deploy AI Solutions at the Edge
- Airbyte’s New Connector Development Kit Commoditizes Data Integration
May 12, 2021
- Confluent Launches Confluent for Kubernetes
- Amplitude Acquires Iteratively
- KX Partners with Databricks to Bring Ultra Real-Time Decision Making to Lakehouse Platform
- Digital Twin Consortium Announces Open-Source Collaboration Community
- Starburst Announces General Availability of Galaxy, Cloud-based Managed Service
- Digital Hive Puts a Consumer Face on Enterprise Analytics and BI
- DataRobot Launches AI for Health Incubator
- Build APIs Easier and Faster, All in 1 Integration Platform with SnapLogic
- PingCAP Announces Public Preview of TiDB Cloud
May 11, 2021
- Esri and IBM Team Up to Take on Climate Change with Call for Code
- Qlik Announces 2021 Global Transformation Awards
- UiPath Announces Integrations with Tableau to Transform Dashboards
- IBM Announces New Hybrid Cloud and AI Capabilities at 2021 Think Conference
Most Read Features
- Big Data File Formats Demystified
- What’s the Difference Between AI, ML, Deep Learning, and Active Learning?
- Composite AI: What Is It, and Why You Need It
- Big Data Predictions: What 2020 Will Bring
- Can Digital Twins Help Modernize Electric Grids?
- Understanding Your Options for Stream Processing Frameworks
- Who’s Winning In the $17B AIOps and Observability Market
- Cohesity Plots Data Biz Expansion from a Backup Base
- Drowning In a Data Lake? Gartner Analyst Offers a Life Preserver
- Why Data Science Is Still a Top Job
- More Features…
Most Read News In Brief
- Confluent Files to Go Public. Who Could Be Next?
- Data Prep Still Dominates Data Scientists’ Time, Survey Finds
- DataRobot Refreshes AI Platform, Nabs Zepl
- Dataiku Gets Closer to Snowflake
- Insightsoftware Loads Up on Embedded Analytics with Logi, Izenda Deals
- ML Scaling Requires Upgraded Data Management Plan
- Grafana Ditches Apache 2.0, Switches to AGPL
- Performance, Complexity Dog K8S Growth
- Global DataSphere to Hit 175 Zettabytes by 2025, IDC Says
- Databricks Edges Closer to IPO with $1B Round
- More News In Brief…
Most Read This Just In
- Novel Use of 3D Geoinformation to Identify Urban Farming Sites
- Domo Rated Exemplary Vendor in Ventana 2021 Embedded Analytics and Data Value Index
- Tecton Unveils Major New Release of Feast Open Source Feature Store
- KIOXIA’s PCIe 4.0 NVMe SSDs Now Qualified with NVIDIA Magnum IO GPUDirect Storage
- SC21: Introducing the [email protected] Data Science Competition
- Crayon Raises $22M Series B to Empower Enterprises with Competitive Intelligence
- Domino Data Lab Debuts New Solutions with NVIDIA to Enhance the Productivity of Data Scientists
- Gartner Highlights 3 Actions for Data and Analytics Leaders to Succeed in a Changing World
- Expert.ai Adds Human-like Understanding Capabilities to its Natural Language API
- Splunk Launches New Observability Cloud
- More This Just In…