Building Presto Business No Magic Trick for Starburst
When it comes to next-gen SQL query engines, there’s no shortage of players jockeying for a starting role on big Hadoop clusters and in the cloud. While the Hives, Impalas, Drills, and Spark SQLs of the world have loyal fan bases, one competitor worth keeping an eye on is Presto, which has journeyman SQL credentials and commercial backing from an outfit called Starburst.
Presto is the distributed in-memory SQL query engine originally developed by Facebook to be a faster and more flexible alternative to Apache Hive, which it also created. Hive is tied to Hadoop and originally had a batch-oriented MapReduce internals (it’s now based on Tez), but Facebook recognized that it needed a way to query data sitting in other data stores, such as the massive MySQL relational database that it still uses. The social media giant also needed faster interactive performance to power huge ad-hoc queries running across its clusters, and so it created Presto.
The SQL engine has been picked up by some pretty big names since Facebook open-sourced it. In 2014, Netflix announced that it uses Presto to query 10PB worth of data it has stored in Amazon S3 (it’s now more than 40PB). Other big tech outfits, including Uber, Lyft, and Airbnb, use the SQL engine too. Presto isn’t the only SQL engine used by these companies, of course, but it’s unique combination of processing speed, coverage of the ANSI SQL standard, and support for non-Hadoop data stores help to win it playing time against better-known players in the SQL lineup.
These bona fides – not to mention Presto’s position as an independent project not backed by one of the three big Hadoop distributors — led Teradata to throw its weight behind Presto. In 2014, the data warehousing giant acquired Hadapt, the commercial outfit that backed Presto at the time. By pairing Presto with its federated QueryGrid technology, Teradata sought to give its marquee enterprise customers the tools to analyze data wherever it sits – in HDFS or in its own data warehouse — and hopefully forestall its customers’ threats to migrate whole-hog into Hadoop.
While Teradata’s backing helped raise Presto’s profile, it also had an unfortunate side-effect: neglect of mid-market firms.
Because Teradata is laser-focused on serving the needs of Fortune 500 accounts, the majority of downloads of the open source Presto software were not followed up on by Teradata. So in late 2017, the players agreed that Teradata would spin the Presto business back out into its own company, called Starburst.
The partnership with Teradata remains strong, according to Starburst CEO Justin Borgman, who says both companies are benefiting from the spin-out.
“Unfortunately for us, Teradata really only is interested in the top 500 customers in the world,” Borgman tells Datanami. “They made that a very strong focus. They really aren’t set up from a go-to market standpoint to sell to anything less than a few-billion-in-revenue company….We felt there was this huge opportunity outside of the top 500 that we could exploit.”
The parties worked out a deal whereby Teradata can continue to resell the Presto support offering, and Starburst will provide the technical support services to those Fortune 500 clients – as well as any other midmarket clients it can land on its own. That gave Starburst a head start relative to other startups.
“We get to start the company with a few million in revenue, which is an awesome change relative to my last startup,” Borgman says. “You don’t usually start out that way. Now we’re trying to figure out how we can grow this and further the adoption of the product.”
In addition to getting those support contracts from Teradata, Starburst has a leg up in that it already has a fairly mature product. That helps to position not only the open source Presto product, but also its Starburst Distribution for Presto (which has all the latest, greatest additions to Presto that haven’t yet made their way into the open source product), in a favorable light against other SQL query engines.
Borgman says Presto has several strengths that should be relevant in customer discussions about SQL engines.
For starters, there’s the adherence to the 2003 ANSI SQL standard. In 2016, an outfit called Radiant Advisors investigated how five SQL tools – Presto, Spark SQL, Impala, Hive on Tez, and classic Hive (running MapReduce 2) – ranked in support for the SQL standard. Presto came out on top, both in terms of its compatibility index (which ranked query compatibility and customization required to execute — see Fig.01) as well as the number of queries that a given engine could run based on a 100GB benchmark.
Borgman attributes this advantage to Facebook’s original design. “The SQL parser was built from scratch when they decided to create the Presto project, as opposed to so many of the other SQL engines, which actually take the Hive parser and sort of staple that onto whatever SQL on Hadoop solution they’re building.”
Questions about SQL coverage are not as frequent as they used to be, Borgman says. “I think we’ve filled in all the pieces that people are really using at this point and now the only things we’re looking at are things like geospatial, which actually has been improving pretty quickly in Presto,” he says. Uber uses geospatial queries, and has made a number of commits to the open source project to support queries against geospatial data, he says.
Speed is another advantage cited by Presto backers. Wherever it’s running – on a Hadoop cluster (not as common), on its own cluster next to Hadoop (more common), or in the cloud (the fastest growing segment) – Presto can flat-out fly if given enough processing power and memory (it doesn’t persist data, so big storage isn’t needed).
“First and foremost, people are attracted to the speed,” Borgman says. “When you benchmark this thing and work with us, you’ll see that it’s just very fast, particular for high-concurrency use cases, where you have many users accessing the system simultaneously. It’s been proven it at scale. You can point to high profile users like Facebook, Uber, Dropbox, Twitter, Airbnb, Netflix etc., who are using this at massive scale.”
Comcast, another Teradata customer who now has a support contract with Starburst for its Presto environment, ran a benchmark test last year — or really a good old-fashion smackdown — to see how the various SQL engines matched up. Hortonworks‘ entry in the race, Hive with its Live Long and Prosper (LLAP) in-memory caching technology, came in first, followed by Presto in second and Hive on Tez in third. Spark SQL was a bit green still and came in fourth, while MapReduce gained the inglorious title of “dumpster fire.”
Borgman wasn’t thrilled with the second-place finish, even though Comcast complimented the stability and reliability of the tech. “That benchmark certainly did not present Presto in the best light,” he said, pointing out that Comcast is actually growing its use of Presto. The source of that less-than-ideal showing, he says, was Presto’s difficulty with large joins.
“It’s really where there are lots of complex joins where Presto has struggled occasionally in the past and that was really because there was no cost model built into the product,” Borgman says. “We think that by virtue of having an optimizer, we’ll see a big performance gain there…We think the new query optimizer that will come out in a couple of weeks will make a huge difference.” The company plans to release TPC-DS benchmarks that prove that point.
Independence is Presto’s third big virtue, particularly in light of the massive growth that Amazon, Microsoft, and Google are having with their public cloud platforms and the proprietary SQL query engines they offer. Because Presto is not controlled by one of the Hadoop distributors or cloud platforms, Borgman says that gives it a leg up.
“What we would hope for is that people see Presto as this a neutral engine that fills such an important role in the stack,” he says. “The SQL stuff is high value, and maybe it’s so high value that I don’t want to lock it down to a particular cloud vendor or a particular Hadoop distribution.”
That freedom of movement also extends to Presto’s capability to query data sitting in a number of different databases, file systems, and other stores, which gives the product another advantage. Borgman calls it the “query anything” approach. “It was really built with that in mind from the beginning,” he says.
One of the benefits of having a software subscription with Starburst — which employs the biggest number of Presto contributors outside of Facebook — is allows a company to suggest new features for subsequent releases of the software. That’s how connectors for Oracle, SQL Server, and MongoDB databases came to be, Borgman says.