Yahoo Casts Real-Time OLAP Queries with Druid
Yahoo is in the process of implementing a big data tool called Druid to power high-speed real-time queries against its massive Hadoop-based data lake. Engineers at the Web giant say the open source database’s combination of speed and usability on fast-moving data make it ideal for the job.
Druid is a column-oriented in-memory OLAP data store that was originally developed more than four years ago by the folks at Metamarkets, a developer of programmatic advertising solutions. The company was struggling to keep the Web-based analytic consoles it provides customers fed with the latest clickstream data using relational tools like Greenplum and NoSQL databases like HBase, so it developed its own distributed database instead.
The core design parameter for Druid was being able to compute drill-downs and roll-ups over a large set of “high dimensional” data comprising billions of events, and to do so in real time, Druid creator, Eric Tschetter wrote in a 2011 blog post introducting Druid. To accomplish this, Tschetter decided that Druid would feature a parallelized, in-memory architecture that scaled out, enabling users to easily add more memory as needed.
Druid essentially maps the data to memory as it arrives, compresses it into a column, and then builds indexes for each column. It also maintains two separate subsystems: a read-optimized subsystem in the historical nodes, and a write-optimized subsystem in real-time nodes (hence the name “Druid,” a shape-shifting character common in role playing games). This approach lets the database query very large amounts of historical and real-time data, says Tschetter, who left Metamarkets to join Yahoo in late 2014.
“Druid’s power resides in providing users fast, arbitrarily deep exploration of large-scale transaction data,” Tschetter writes. “Queries over billions of rows, that previously took minutes or hours to run, can now be investigated directly with sub-second response times.”
Metamarkets released Druid as an open source project on GitHub in October 2012. Since then, the software has been used by a number of companies for various purposes, including as a video network monitoring, operations monitoring, and online advertising analytics platform, according to a 2014 white paper.
Netflix was one of the early companies testing Druid, but it’s unclear if it implemented it into production. One company that has adopted Druid is Yahoo, the ancestral home of Hadoop. Yahoo is now using Druid to power a variety of real-time analytic interfaces, including executive-level dashboards and customer-facing analytics, according to a post last week on the Yahoo Engineering blog.
Yahoo engineers explain Druid in this manner:
“The architecture blends traditional search infrastructure with database technologies and has parallels to other closed-source systems like Google’s Dremel, Powerdrill and Mesa. Druid excels at finding exactly what it needs to scan for a query, and was built for fast aggregations over arbitrary slice-and-diced data. Combined with its high availability characteristics, and support for multi-tenant query workloads, Druid is ideal for powering interactive, user-facing, analytic applications.”
Yahoo landed on Druid after attempting to build its data applications using various infrastructure pieces, including Hadoop and Hive, relational databases, key/value stores, Spark and Shark, Impala, and many others. “The solutions each have their strengths,” Yahoo wrote, “but none of them seemed to support the full set of requirements that we had,” which included adhoc slice and dice, scaling to tens of billions of events a day, and ingestion of data in real-time.
Another property of Druid that caught Yahoo’s eye was its “lock-free, streaming ingestion capabilities.” The capability to work with open source big data messages busses, like Kafka, as well as working with proprietary systems, means it fits nicely into its stack, Yahoo said. “Events can be explored milliseconds after they occur while providing a single consolidated view of both real-time events and historical events that occurred years in the past,” the company writes.
As it does for all open source products that it finds useful, Yahoo is investing in Druid. For more info, see the Druid website at http://druid.io.
October 25, 2021
- Sun Life Deploys Privacera to Accelerate AWS Migration
- The expert.ai NL API Now Available in AWS Marketplace
- Franz Announces AllegroGraph 7.2
- Teradata and H2O.ai Partnership Accelerates Enterprise AI Adoption in the Cloud
October 22, 2021
October 21, 2021
- Dremio Announces New Dart Initiative Release
- Hex Technologies Raises $16 Million Series A to Help Data Teams Do More
- 2021 GigaOm Radar Report for Data Warehouses Names Yellowbrick Data an Outperformer
- DataRobot Research Finds 86% of Organizations Prioritize AI and ML
- Terrafuse AI Launches New Platform to Visualize California Wildfire Risk
- New Relic Launches In-IDE Observability and Code Collaboration Experience
- KX Announces Launch of KX Academy On-Demand Training Portal
- KDD 2021 Celebrates Winning Teams of 25th Annual KDD Cup
- Global Survey Reveals 8 in 10 Companies Struggle to Unify Data Assets
October 20, 2021
- OctoML Announces Collaboration with Arm for ML Models
- VAST Data Introduces VASTOS Version 4
- DAS42 and AtScale Partner to Deliver Advanced Data Technology Solutions
- Iguazio MLOps Platform Now Supports Amazon FSx for NetApp ONTAP
- Credo AI Emerges from Stealth to Help Organizations Build Ethical AI
- Exxact Partners with SoftIron to Provide Ceph-Based Software Defined Storage Solutions
Most Read Features
- Google Cloud Gives Spanner a PostgreSQL Interface
- What Is Data Science? A Turing Award Winner Shares His View
- Big Data File Formats Demystified
- We’re In the Moneyball 3.0 Era. Here’s What It Means for Live Sports
- What’s the Difference Between AI, ML, Deep Learning, and Active Learning?
- Who’s Winning In the $17B AIOps and Observability Market
- Composite AI: What Is It, and Why You Need It
- Five Real-World Applications for Sports Analytics
- One on One with Google Cloud Product Director Irina Farooq
- HPE Adds Lakehouse to GreenLake, Targets Databricks
- More Features…
Most Read News In Brief
- Data and AI Salaries Continue Upward March, O’Reilly Says
- Data Prep Still Dominates Data Scientists’ Time, Survey Finds
- The Next Breakthrough in Long-Term Data Storage is….Gold?
- Gartner Shuffles the Technology Deck with Latest ‘Hype Cycle’ Report
- Why Is SAS Going Public?
- Feature Stores Emerging as Must-Have Tech for Machine Learning
- Sisu Nabs $62M to Grow Data Analytics Biz
- LinkedIn Open Sources Tech Behind 10,000-Node Hadoop Cluster
- Here’s What Splunk Announced Today at .conf21
- Hydrolix Puts Big Log Data In Its Place: The Cloud
- More News In Brief…
Most Read This Just In
- Esri Releases ArcGIS GeoBIM, Bringing Spatial Context to AEC Operations
- Databricks Acquires Low-code/No-code Company to Expand its Lakehouse Platform
- PrivaceraCloud 4.0 Enables Governed Data Sharing Across the Open Cloud
- NetApp to Acquire CloudCheckr and Expand its Spot by NetApp CloudOps Platform
- TIBCO Delivers a Comprehensive, Connected Platform for the Adaptable Digital Business
- Dremio Announces New Dart Initiative Release
- BriefCam Introduces Video Analytics Enabled on Deep Learning Cameras from Axis Communications
- Transaction Processing Performance Council (TPC) Launches an Artificial Intelligence Benchmark (TPCx-AI)
- Sinequa Accelerates Time-to-Value with “Starter” Insight Apps
- Fluent Project Creators Announce Calyptia Cloud
- More This Just In…
Sponsored Partner Content
October 27 - October 28
November 29 - December 3
December 6 - December 10San Diego CA United States
February 7, 2022 - February 9, 2022Houston TX United States
June 26, 2022 - June 30, 2022Hollywood FL United States