Yahoo Casts Real-Time OLAP Queries with Druid
Yahoo is in the process of implementing a big data tool called Druid to power high-speed real-time queries against its massive Hadoop-based data lake. Engineers at the Web giant say the open source database’s combination of speed and usability on fast-moving data make it ideal for the job.
Druid is a column-oriented in-memory OLAP data store that was originally developed more than four years ago by the folks at Metamarkets, a developer of programmatic advertising solutions. The company was struggling to keep the Web-based analytic consoles it provides customers fed with the latest clickstream data using relational tools like Greenplum and NoSQL databases like HBase, so it developed its own distributed database instead.
The core design parameter for Druid was being able to compute drill-downs and roll-ups over a large set of “high dimensional” data comprising billions of events, and to do so in real time, Druid creator, Eric Tschetter wrote in a 2011 blog post introducting Druid. To accomplish this, Tschetter decided that Druid would feature a parallelized, in-memory architecture that scaled out, enabling users to easily add more memory as needed.
Druid essentially maps the data to memory as it arrives, compresses it into a column, and then builds indexes for each column. It also maintains two separate subsystems: a read-optimized subsystem in the historical nodes, and a write-optimized subsystem in real-time nodes (hence the name “Druid,” a shape-shifting character common in role playing games). This approach lets the database query very large amounts of historical and real-time data, says Tschetter, who left Metamarkets to join Yahoo in late 2014.
“Druid’s power resides in providing users fast, arbitrarily deep exploration of large-scale transaction data,” Tschetter writes. “Queries over billions of rows, that previously took minutes or hours to run, can now be investigated directly with sub-second response times.”
Metamarkets released Druid as an open source project on GitHub in October 2012. Since then, the software has been used by a number of companies for various purposes, including as a video network monitoring, operations monitoring, and online advertising analytics platform, according to a 2014 white paper.
Netflix was one of the early companies testing Druid, but it’s unclear if it implemented it into production. One company that has adopted Druid is Yahoo, the ancestral home of Hadoop. Yahoo is now using Druid to power a variety of real-time analytic interfaces, including executive-level dashboards and customer-facing analytics, according to a post last week on the Yahoo Engineering blog.
Yahoo engineers explain Druid in this manner:
“The architecture blends traditional search infrastructure with database technologies and has parallels to other closed-source systems like Google’s Dremel, Powerdrill and Mesa. Druid excels at finding exactly what it needs to scan for a query, and was built for fast aggregations over arbitrary slice-and-diced data. Combined with its high availability characteristics, and support for multi-tenant query workloads, Druid is ideal for powering interactive, user-facing, analytic applications.”
Yahoo landed on Druid after attempting to build its data applications using various infrastructure pieces, including Hadoop and Hive, relational databases, key/value stores, Spark and Shark, Impala, and many others. “The solutions each have their strengths,” Yahoo wrote, “but none of them seemed to support the full set of requirements that we had,” which included adhoc slice and dice, scaling to tens of billions of events a day, and ingestion of data in real-time.
Another property of Druid that caught Yahoo’s eye was its “lock-free, streaming ingestion capabilities.” The capability to work with open source big data messages busses, like Kafka, as well as working with proprietary systems, means it fits nicely into its stack, Yahoo said. “Events can be explored milliseconds after they occur while providing a single consolidated view of both real-time events and historical events that occurred years in the past,” the company writes.
As it does for all open source products that it finds useful, Yahoo is investing in Druid. For more info, see the Druid website at http://druid.io.
June 17, 2021
- Esri’s ArcGIS Platform Chosen for Red Bull X-Alps Competition Live Tracking App
- Collibra Announces 2021 Excellence Awards
- Latest Release of InterSystems IRIS Data Platform Provides Next Step in Data Fabric Adoption
- Zaloni Automates Data Governance, Fast Tracks Data Access with 6.4 Platform Release
- Qumulo, HPE GreenLake Cloud Services to Provide Pay-As-You-Go File Platform for Unstructured Data
- Lucidworks Joins Google Cloud Partner Advantage Program, Launches AI-Powered Search Platform
- TigerGraph Announces Center of Innovation in San Diego, R&D and Recruitment Efforts
- Monte Carlo, PagerDuty Integration Bring DevOps to Data Pipelines with End-to-End Observability
- HPE Passes Rigorous Splunk Engineering Tests for Kubernetes Operator with HPE Ezmeral
- Partners Together Now: Snowflake Announces FY21 Partner of the Year Award Winners
June 16, 2021
- Vertica Announces Early Access of Vertica Eon Accelerator
- Alation Named Top Vendor in End-User Study of Data Catalog Market for Fifth Consecutive Year
- Fetch.ai, Poznan Supercomputing and Networking Center to Develop AI Tools For Cancer Cell Detection
- MLCommons Releases MLPerf Tiny Inference Benchmark
- LexisNexis Risk Solutions Celebrates 10-Year Open Source Anniversary of HPCC Systems Platform
- GRAX Announces History Stream, Unleashing SaaS App Data for Easy Downstream Consumption
- Infinidat Expands InfiniBox Line with New Solid-State Array for Demanding Enterprise Applications
- Imply Closes $70 Million Series C at $700M Valuation
- New Study Debunks 5 Common Myths Holding Enterprises Back from AI Success
June 15, 2021
Most Read Features
- Newly ‘Headquarterless’ Snowflake Makes a Flurry of Announcements
- Big Data File Formats Demystified
- What’s the Difference Between AI, ML, Deep Learning, and Active Learning?
- Do Customers Want Open Data Platforms?
- Understanding Your Options for Stream Processing Frameworks
- Google Cloud Overhauls AI with Vertex Launch
- Why Data Science Is Still a Top Job
- Databricks Unveils Data Sharing, ETL, and Governance Solutions
- Three Reasons Python Is The AI Lingua Franca
- Cloudera To Go Private in $5.3 Billion Buyout by Wall Street Firms
- More Features…
Most Read News In Brief
- Confluent Files to Go Public. Who Could Be Next?
- Confluent S-1 Reveals ‘Reimagining of Business’ Theme
- Lakehouses Prevent Data Swamps, Bill Inmon Says
- PlanetScale Unveils Distributed MySQL Database Service Based on Vitess
- Google Cloud Tackles Data Unification with New Offerings
- Google’s ‘Breakthrough’ LaMDA Promises to Elevate the Common Chatbot
- Alteryx Dips Analytics Tools Into Machine Learning Waters
- Qualcomm Unveils 5G Modem for IoT
- MIT Analytics Reveal How Anti-Maskers Leverage Data Visualization
- KAIST Introduces T-GPS, a Tool for Processing a Trillion-Edge Graph on One Computer
- More News In Brief…
Most Read This Just In
- SAS Named a Leader in Streaming Analytics Per Independent Research Firm
- Sumo Logic Signs Definitive Agreement to Acquire Sensu to Extend Open Source Strategy
- Relativity Acquires Text IQ to Drive Leadership in AI for e-Discovery, Compliance and Privacy
- University of Texas at San Antonio Researchers Collaborate to Improve Computer Vision for AI
- US Air Force RSO Expands Engagement with C3 AI as Strategic AI Platform
- Airbyte’s New Connector Development Kit Commoditizes Data Integration
- Dgraph Rises to the Top Graph Database on GitHub with 11 G2 Badges, 11M Downloads
- Latest Release of SnapLogic Fast Data Loader Provides Fast, Free Cloud Data Warehouse Loading
- Digital Twin Consortium Announces Open-Source Collaboration Community
- Incorta Announces Tableau Connector to Extend Faster Data Analytics to All Customers
- More This Just In…