Actian Claims ‘Permanent Performance Advantage’ with SQL-on-Hadoop Tool
The SQL-on-Hadoop sweepstakes are by no means over. What’s been dubbed the “gateway drug” for Hadoop is just starting to gain traction. But according to Actian, its SQL-on-Hadoop offering, dubbed Vortex, is out to an early–and permanent–lead in the performance department.
At the recent Strata + Hadoop World show, Actian pitted Vortex against Cloudera’s Impala right in the booth, where it largely re-created the results of a 2014 TPC Decision Support (TPC-DS) benchmark test that showed Vortex completing a job up to 30 times faster than Impala.
Such comparisons can be useful, but they should also be taken with a grain of salt. Benchmarks are notoriously poor at predicting real-world conditions, and vendors have been known to fiddle with their systems to fudge the results in their favor. While the TPC-DS benchmark was specifically designed to cut down on these types of shenanigans, the fact that nobody appears to be publicly sharing their TPC-DS benchmarks in an open way raises additional questions.
That’s not stopping Actian from talking about Vortex, which it says (unsurprisingly) is winning the SQL-on-Hadoop war. The company’s big data point man, CTO Mike Hoskins, shared his views on the state of Hadoop SQL at the recent show in San Jose, California.
“Impala is good,” he says. “But you just don’t write these things from scratch and run every query…It takes 10 years to write a really good, fully functioning, enterprise-class” SQL engine.
Vortex, you will remember, is a parallelized version of the Vector database that Actian developed for Hadoop. Previously called VectorWise, the column-oriented analytic database was originally developed by Peter Boncz a decade ago as part of the X100 project at a Dutch national research institute.
When it came out, VectorWise was one of a new class of massively parallel analytic databases developed specifically to provide a new level of performance on big data problems. Hoskins puts VectorWise in the same class as two other databases, including Mike Stonebraker’s Vertica database (now owned by Hewlett-Packard) and Barry Zane’s ParAccel database, which Actian also owns (Actian is also an investor in Zane’s latest startup, SPARQL City).
“These were the three shiny, new, all-columnar, all-analytic, software-only, scale-out-on-commodity-hardware databases,” Hoskins says. “They are fundamentally different than the rest of the databases, in my opinion. So we enjoy permanent performance advantages over that.”
The secret sauce that makes Vortex so darned fast, Hoskins says, is vector processing. “Slowly people are realizing that vector processing is a massive innovation that they have to have in the database,” he says.
Imapala is not the only competition Actian has for Vortex, of course. Hortonworks continues to work on Hive (which everybody seems to despise, Hoskins says) with its Stinger initiative, HP is shipping a version of Vertica for Hadoop, Pivotal’s HAWQ is soon-to-be in the open source realm, MapR has a play with Drill, and even IBM’s Big SQL gets play.
Hoskins says it could take 10 years for competitors to catch up to Vortex. “We see queries [from Tableau] that have 500 lines in them,” he says. “Try putting that in a query planner and optimizer and have it understand perfectly how to distribute a balanced, parallel workloads around an HDFS cluster. That’s non-trivial stuff, and we solved it already and brought it into Hadoop instead of writing it from scratch.”
Hoskins admits that Hadoop is bigger than just SQL. It is, after all, called Structured Query Language, which means it’s not great at crunching vast amounts of unstructured or semi-structured data. Actian offers other tools for hammering messy data down into more structured data, or for tackling “unknown unknown” types of problems, including graph analytics and its triple-store.
But once business data is in a relatively stable form, customers still want to use SQL to solve those “known unknown” types of problems. “Unstructured data is interesting and we handle that,” Hoskins says. “But there’s still an opportunity to take certain data sets that you’re trying to interrogate over and over with the lowest latencies in the world, and pay for that cost of loading them into a fixed schema SQL database, so you can get not only incredible response time and access.”
It’s all about making business analysts who are skilled in SQL and have domain expertise productive on Hadoop. “This is about addressing that shortage in the Hadoop world, where people are slinging Pig and MapReduce code, tragically,” Hoskins says. “What if you could bring a high-level, high-productivity, high-function language like SQL to the game? It could be very important.”