Leverage Big Data
Language Flags

Translation Disclaimer

HPCwire Enterprise Tech HPCwire Japan


October 25, 2012

Cloudera Runs Real-Time with Impala


It’s been an exciting week for the Hadoop community with new developments and tweaks galore, both on the open source and proprietary fronts.

Today the most established of the three major Hadoop distro vendors, Cloudera, upped the ante by throwing a scalable, distributed real-time query engine into the elephant’s sphere.

The effort, called Cloudera Impala, takes the platform past its batch roots, via an Apache-licensed engine that turns around HDFS and HBase data in ways that few thought Hadoop was capable of.

Impala was the brainchild of Marcel Kornacker, who previously co-architected the query engine for the F1 project at Google. Cloudera stated this week that it is architected around a flexible data model so it can work over more complex data than a data warehouse and is efficient, with interactive queries expressed in industry-standard SQL.

“We believe that the addition of Impala opens up a bunch of new valuable workloads that were previously off limits. It allows our customers to offload their big data warehouse so they can optimize how they invest in data,” said Mike Olson, CEO of Cloudera during our chat with him this week in New York City during Strata Hadoop World.

During our conversation, Olson detailed Impala’s core value for users and the open source ecosystem in general, noting three key factors, including openness, performance and capability for a potentially larger set of coming Hadoop use cases. While Hadoop has always been a strong player in the web and media communities it was designed to serve, he said financial services and increasingly healthcare are the hot areas for the platform. While Impala isn’t a big enough boost to thrust Hadoop into the world of high-frequency trading, he thinks it’s enough of a boost to bolster the high-end risk analysis and management analytics—not to mention a range of new opportunities that will drive everything from next-gen advertising to personalized medicine.

“For the last four years you could run SQL queries on Hadoop via Hive. It would take your query, bust it into MapReduce jobs, farm it out the cluster. You’d hit send and then wait…and wait.” While he says that since much of this work had been impossible before, users were willing to wait that fifteen minutes—two hours—whatever it took. However, to open Hadoop to a whole new, valuable branch of higher performance workloads, they needed to built this new engine or execution layer, Impala, while leaving functionality and expected features in place. “We haven’t removed MapReduce,” said Olson. “We still let you run on the same data (no need to move HDFS or HBase data out), run queries on the same data, including the results, and we leave an in for users to tap the desired tools for the workload.”

Impala has been in development for over two years before it was turned around as a private beta to a few select large-scale customers, including Expedia. If you’ll recall, we recently had a sit-down with the online travel booking giant at the SAS event two weeks ago where we talked about how they’re managing vast, diverse volumes of data on their mid-sized Hadoop cluster. Of course, all of this was before they were able to talk about the Impala angle, but for a company that sends nearly all of its data directly through Hadoop, this has to be a noteworthy addition to their big data arsenal.

According to Expedia’s Director of Global BI and Data Warehousing Platforms, Jeff Prather, the addition of Impala meant his team was “able to work on one single platform for big data rather than many disparate systems for archiving, ETL and analytics.” Prather said that this “evolution of Hadoop has enabled us to reduce our latency by 50% and produce a new real business insight service not previously viable."

While the company could have easily created this as a proprietary addition to their enterprise-grade services, it is open source, just like the rest of their distro—the proprietary key being their management suite. Olson stressed the importance of open source to their business model, noting that if they went the proprietary route they would face an uphill climb with CIOs who want a pure open system they can move in and out of (and tweak) at will. Besides, as Olson noted,  “If you put a petabyte into any system at all, you’re locked into that system. The data has so much inertia that it’s painful to move.”

The CEO said his company puts about half of its engineering dollars behind boosting open source Hadoop, which he said is good for the platform, its users, and even the company’s competitors. “Open source development does another good thing for us,” he said. “It lets us share the cost of the work; I had 50 engineers writing software that we gave away last year but in the community the total population working on Hadoop was well over 500—that’s a 10x addition to your ranks.”

Even though Cloudera preaches the open source, zero lock-in philosophy, this doesn’t mean that they’re taking anti-proprietary approach to their viability as a long-standing company. “We are building towards becoming a software company, not a services and support company” Olson said, noting that hitting the IPO tipping point requires something other than the Red Hat model. The services and support angle provided a useful springboard four years ago to help the company generate awareness, development efforts and new users in the beginning, but that accounts for just under one-third of the company’s revenue now.

The future of the company’s business (and attractiveness for future investment) lies in their subscription-based model, which leverages the company’s Cloudera Manager offering to manage and monitor deployments, in addition to the guidance they offer enterprise customers. The open source implementations the company offers for free are perfectly well-suited to a certain class of users, said Olson, but “when it comes down to truly enterprise-level stuff like setting and hitting SLAs, provisioning new users, setting security policies, handling capacity planning and so forth, then it’s fair game for us to put that in the management suite and fair game to get paid for it.”

When it comes to pitting the other distro vendors against what Cloudera has with Impala and its open source approach, Olson said they have a leg up on the competition. Not only are they the only ones with a real-time query engine, companies like MapR, which has he said “took a chainsaw to Hadoop, cut away the open source storage layer and swapped in their own proprietary file system” are going into this with a flawed business model. Whereas Cloudera is looking to become a software company versus a services and support enterprise, he doesn’t think that MapR’s approach is going to be sustainable for long-term investments. The lock-in issue is a big deal to CIOs, he told us, and despite his stated high regard for MapR’s leadership and engineering team, which he genuinely praised, customers will want assurances that their data isn’t going to be tied up in a bulky proprietary file system when the other distro vendors are producing more stable, reliable and higher performing platforms.

We chatted briefly about the other elephant in the room, Hortonworks, which he stated equal respect for on the engineering and leadership fronts, but dissed them for being too slow to innovate. Olson claims that their current model is equivalent to CDH3, which doesn’t haven’t the robust fault tolerance and stability that the others have managed to add in the meantime.

In the end, says Olson, it’s the difference between UNIX and LINUX. It’s all about openness and that robust community behind an open source effort, which would cost smaller companies a fortune in talent. By giving back, he said, Cloudera receives more than its share in dividends by bolstering the platform, adding functionality, and getting a ready and willing “workforce” to extend Hadoop far past its batch beginnings.

Related Articles

Six Super-Scale Hadoop Deployments

Cloudera CTO Reflects on Hadoop Underpinnings

Expedia Adds Notes to Big Data Symphony

Share Options


Subscribe

» Subscribe to our weekly e-newsletter


Discussion

There are 0 discussion items posted.

 

Most Read Features

Most Read News

Most Read This Just In

Cray Supercomputer

Sponsored Whitepapers

Planning Your Dashboard Project

02/01/2014 | iDashboards

Achieve your dashboard initiative goals by paving a path for success. A strategic plan helps you focus on the right key performance indicators and ensures your dashboards are effective. Learn how your organization can excel by planning out your dashboard project with our proven step-by-step process. This informational whitepaper will outline the benefits of well-thought dashboards, simplify the dashboard planning process, help avoid implementation challenges, and assist in a establishing a post deployment strategy.

Download this Whitepaper...

Slicing the Big Data Analytics Stack

11/26/2013 | HP, Mellanox, Revolution Analytics, SAS, Teradata

This special report provides an in-depth view into a series of technical tools and capabilities that are powering the next generation of big data analytics. Used properly, these tools provide increased insight, the possibility for new discoveries, and the ability to make quantitative decisions based on actual operational intelligence.

Download this Whitepaper...

View the White Paper Library

Sponsored Multimedia

Webinar: Powering Research with Knowledge Discovery & Data Mining (KDD)

Watch this webinar and learn how to develop “future-proof” advanced computing/storage technology solutions to easily manage large, shared compute resources and very large volumes of data. Focus on the research and the application results, not system and data management.

View Multimedia

Video: Using Eureqa to Uncover Mathematical Patterns Hidden in Your Data

Eureqa is like having an army of scientists working to unravel the fundamental equations hidden deep within your data. Eureqa’s algorithms identify what’s important and what’s not, enabling you to model, predict, and optimize what you care about like never before. Watch the video and learn how Eureqa can help you discover the hidden equations in your data.

View Multimedia

More Multimedia

NVIDIA

Job Bank

Datanami Conferences Ad

Featured Events

May 5-11, 2014
Big Data Week Atlanta
Atlanta, GA
United States

May 29-30, 2014
StampedeCon
St. Louis, MO
United States

June 10-12, 2014
Big Data Expo
New York, NY
United States

June 18-18, 2014
Women in Advanced Computing Summit (WiAC ’14)
Philadelphia, PA
United States

June 22-26, 2014
ISC'14
Leipzig
Germany

» View/Search Events

» Post an Event