Druid Summons Strength in Real-Time
This has indeed been the year of Hadoop as the most appropriate synonym for big data, but according to some who hover on the fringes of that ecosystem (and even some who are in the middle of it), a lot of the tech behind it has been fetishized to the point of not being useful or comprehensible to actual business users.
From the Hortonworks exec who told us that many businesses are confused about where Hadoop belongs in their enterprise strategy, to Mike Driscoll, CEO of Metamarkets, who told us this week that there is no compelling message or context for enterprise technology folks, the next big thing for big data might simply be a long string of ease of use (versus further complicated functionality) for technology executives.
At the heart of some of this usability with added capability or performance goal sits the almighty cloud. By wicking away the hardware headache and management hassles, companies like Driscoll’s can add features that might boost speed without requiring a bevy of new hires.
Metamarkets focuses distinctly on what it calls “web-scale” companies, which for their business tends to include a number of large-scale digital publishing companies, including Financial Times as well as several higher-end online advertising platform vendors.
In an effort to expand their “real-time” capabilities, the company architected a streaming data store component to its cloud-delivered analytics platform called Druid, which was recently set free to spread its wings in the open source community. This data store, which was named after a shape-shifting comic character, could address some of the challenges of traditional database approaches—or at the very least provide the open source community with something new to chew on.
On that note, it’s important to define what’s really meant by real-time here. For an area like high-frequency trading, for instance, no amount of fiddling with Druid could make it battle-ready. However, since Driscoll defines real-time as less than a thousand milliseconds (via cloud delivery even), he claims for most use cases in their bread and butter kitchen (ad platforms) this is certainly fast enough.
The company says that for the primary markets it serves, namely large-scale web publishing and digital advertising, the stack it’s built to run on (large non-tricked-out Amazon EC2 instance types) already leverages a number of open source projects for processing, querying and visualizing high volume streaming data. However, as Driscoll described for us, Druid offered up something they weren’t able to find elsewhere—the ability to stream data via an in-memory approach using a column-based data store for lower latency query response times.
Metamarkets’ pitch with Druid is that Hadoop is not the cure-all to big data woes. He says that while it’s an excellent approach for massive data, the time it takes to chew through queries is too long. When asked about how this might be solved with other approaches, including Impala, the new “real time” Hadoop system developed by Cloudera, he said that is great for making queries fast but misses the mark when it comes to interactivity of data.
As Driscoll put it, a real-time Hadoop tool like Impala can address the speed concerns around issuing a query to a Hadoop system, but that’s only one aspect of what is important to companies who are actually going to make use of a faster, more efficient Hadoop. “Speed is just one part of it; you can launch all kinds of MapReduce processes, and while the response might be fast, the matter of latency between when an event happens and when you know about it is the second critical component there.”
Part of what makes Druid noteworthy is that as events come into the Metamarkets open source data store, they are immediately accessible for querying. Therefore, as Driscoll describes, “anything that has a Hadoop-backed architecture can always have something built on top like a caching layer, but unless it’s integrated into the overall architecture, it’s hard to get any real-time visibility into your data.”
On the surface, Druid looks like a standard MPP database, like a Vertica or Netezza, but the key difference is that it’s been architected from scratch to fit within a cloud context. On the other side, it behaves somewhat similarly to Hadoop with data sharding across many nodes, partitioning and parallel performance boosts, not to mention a similar fault tolerance mechanism that offers double-replication of data (whereas Hadoop offers triple). The secondary value prop is the in-memory component, which Driscoll says offers a 1,000x performance increase over traditional database approaches like Vertica, Netezza, Greenplum and even the much-heralded Dremel (not far removed from Impala).
When it comes to Dremel, for example, Driscoll says the differentiation is in the fact that Dremel and others are still disk-bound where read speeds are about 1,000x lower than when reading off DRAM. “In-memory databases have a massive performance advantage over traditional disk-backed databases, so they are a key component of Druid,” he said.
In addition to the aspects of Druid he mentioned, the concept of rolling restarts might be a worthwhile capability for some users who don’t want to have to restart the database with the addition of new code, since it’s possible to bring nodes down one at a time then back up again in a rolling fashion.
“We started to see a clear need in the Hadoop ecosystem for something that could be real-time and fast at scale,” said Driscoll. “Besides, we believed strongly in having an open source component since the era of licensed software is coming to a close—the future belongs to cloud-backed SaaS.” The Metamarkets CEO went on to tell us that the real-world customers they deal with don’t want to think about Hadoop, databases or the underlying stack – all they want is results. Thus, being able to deliver the entire stack with reliability built in as well as the complexity of Hadoop and on-site hardware removed is valuable. He claims that since they open-sourced Druid in mid-October there have been hundreds of GitHub downloads and over 20 forks to the database built.
All the open source interest in the world is useless without an actual use case, but when it comes to users of Druid, the company was quick to point to Netflix as proof of scale. Driscoll told us that the video giant got wind of the open source data store and was granted an early look at the architecture and tested the offering.
The point is that it is able to scale. Metamarkets looks at over a trillion events processed on its platform on a daily basis and between 10-20 billion events per day. He said that overall, in a market like online advertising, there are around 100 billion micro-transactions per day across the vendors who cater to this high-volume, high-speed and quickly growing segment. Therefore, it’s not surprising that some of the core innovation in big data technology is coming from the areas of online ad markets. Their message of usability and performance, however, can be easy to lose since their core customer base is so targeted and they don’t tend to pitch heavily to further verticals, even if they see applicability in emerging areas.
We talked briefly about how online advertising big data needs are driving innovation that can be carried over to other industries like insurance, healthcare, smart grids and other areas. His thought on this mesh was noteworthy, if not somewhat epic…
“What we’re witnessing in the world of digital advertising is the birth of a global digital nervous system. The same kind of wiring that helps it work can easily extend to other verticals…the tech emerging here will lead the next generation of tech for other industries.”
- Click to share on Twitter (Opens in new window)
- Share on Facebook (Opens in new window)
- Click to share on Google+ (Opens in new window)
- Click to share on Pocket (Opens in new window)
- Click to share on Reddit (Opens in new window)
- Click to share on Pinterest (Opens in new window)
- Click to share on Tumblr (Opens in new window)
- Click to share on StumbleUpon (Opens in new window)
- Click to email this to a friend (Opens in new window)
May 22, 2015
- Polytechnique Montréal Recruits Renowned Data Scientist Andrea Lodi
- Glassbeam Collaborates With UC Santa Cruz in Launching CoE in Data Science Research
- Qubole Data Service Being Utilized by TubeMogul
- Metanautix to Present on Scalable Analytics Across NoSQL, RDBMS, and Hadoop at Couchbase Connect
May 21, 2015
- DataTorrent Joins the Open Data Platform Initiative, Brings Real-Time Processing of Data-In-Motion to ODP
- Smartsheet Announces Real-Time Data Exchange with Tableau
- Ryft Open API Library Extends Lightning-Fast Batch and Streaming Analysis to Existing Big Data Environments
- Clarabridge Acquires Engagor
May 20, 2015
- Zettaset Boosts Big Data Protection for IBM Power Systems servers with Enterprise-Class Encryption for Hadoop and NoSQL Databases
- MemSQL Launches Community Edition: World’s Fastest In-Memory Database Now Available to All
- Fanzz Selects Attunity Replicate to Enable Big Data Analytics
May 19, 2015
- Welltok Acquires Leading Healthcare Analytics Company Predilytics
- Skillsoft and IBM Research Unveil New Applications for Big Data in Learning and Talent Development
- Rosslyn Analytics Pioneers Analytics-as-a-Service on Microsoft Azure
- DataStax Announces DSE 4.7
- DataHero Secures $6.1M in Series A Funding to Simplify Data Analysis
May 18, 2015
- H2O.ai Reinvents Machine Learning for Smart Applications with 3.0 Release
- Amdocs Introduces New Big Data Actionable Analytics
- Tableau Online 9.0 Makes Cloud Analytics Faster
- Toshiba Demonstrates High Performance Object Storage Technologies at OpenStack Summit 2015
Most Read Features
- How Machine Learning Is Eating the Software World
- 9 Must-Have Skills to Land Top Big Data Jobs in 2015
- Solr or Elasticsearch–That Is the Question
- Hadoop’s Next Big Battle: Apache Versus ODP
- Five Reasons Machine Learning Is Moving to the Cloud
- Deep Dive Into Databricks’ Big Speedup Plans for Apache Spark
- Apache Spark: 3 Real-World Use Cases
- What Police Can Learn from Deep Learning
- Transforming PostgreSQL into a Distributed, Scale-Out Database
- Five Steps to Running ETL on Hadoop for Web Companies
- More Features…
Most Read News In Brief
- Six Big Name Schools with Big Data Programs
- Experfy: The Uber of Big Data Projects
- Who’s ‘Cool’ In Storage Now?
- Pivotal, Hortonworks Join Forces on Hadoop
- Gartner: Hadoop Adoption ‘Fairly Anemic’
- What Informatica’s Buyout Means to Big Data Integration
- Hadoop on a Raspberry Pi
- Grad Schools Move to Fill Data Skills Gap
- Microsoft Scales Data Lake into Exabyte Territory
- Analytics Leaders Target Speed, Context, Functionality
- More News In Brief…
Most Read This Just In
- MapR Enters 2015 With Over 100% Bookings Growth
- Booz Allen Hamilton Hires Industry Expert Kirk Borne as Principal Data Scientist
- Accenture Launches Advanced Analytics Applications Platform
- H2O.ai Reinvents Machine Learning for Smart Applications with 3.0 Release
- GoodData Releases New Additions to Cloud Analytics Platform
- Hortonworks Reports Financial Results for First Quarter 2015
- IBM Unveils New Servers and Software Designed for Cloud Computing
- MemSQL Announces Significant Market Growth
- Birst 5X Unveiled
- MapR to Discuss Hadoop, Apache Drill, and Spark at Big Data Everywhere Conferences
- More This Just In…
May 28 - May 29New York NY United States
June 9 - June 11New York United States
June 10 @ 8:00 am - June 11 @ 5:00 pmPhiladelphia PA United States
July 8 - July 10Santa Clara CA United States
August 27 - August 28Houston TX United States