Making Hadoop Relatable Again
There has been much debate over the future of Hadoop in recent months. Should it work more like a cloud object store? Should it support GPUs and FPGAs, Docker or Kubernetes (or both)? Should compute and storage be separated in Hadoop? Is it even necessary anymore? The folks at Splice Machine have their own take: If you make Hadoop look more like a relational database, then people will do more with it.
“Everybody is struggling to figure out how to expose what’s been put on the data lake to the business,” Splice Machine founder and CEO Monte Zweben told Datanami at the recent Strata Data Conference. “Our opinion is that you can take infrastructure that people understand, like relational database management systems, and run them directly on the data lake.”
That’s essentially the message that Splice has been pushing since the peak of the Hadoop frenzy in the 2013-2015 timeframe, and it’s the same message that it’s pushing today. The big difference, according to Zweben, is the maturity level. Splice Machine’s open source technology that essentially turns Hadoop into a distributed ACID-compliant relational database is now ready for primetime. Wells Fargo is arguably its biggest paying customer and production use case, but it has dozens more across financial services, healthcare and other industries.
“We’re at a point of inflection as company,” Zweben said. “We spent the last four years making the transactional database really work at scale, so being able to have petabyte-scale customers getting millisecond response times to queries for record lookups. That hasn’t really been done before at the SQL level with ACID compliance, and we finally proved it at that level for production data.”
Earlier this month the San Francisco company announced a new connector for Apache Spark that extends its Hadoop-resident RDBMs further into the world of Apache Spark. While Splice already utilized Spark (along with HBase) as a execution engine, the new connector brings Spark DataFrames into the Splice fold.
According to Splice, the connector brings two main benefits. First, it extends all the CRUD-like benefits of Splice’s database – including creating tables and inserting, updating, upserting, deleting, and querying data – to Spark DataFrames. Secondly, it makes data in Splice’s database available to Spark engines, such as MLlib, Spark Streaming, R, and Spark SQL.
Having a full database backing Spark will simplify the data movement activities for data scientists and engineers working in Spark, Zweben says. For starters, they no longer need to use JDBC or ODBC connections, which require data to be serialized and moved one record at a time. This will help for ETL and streaming analytics use cases.
The DataFrames API will also help for machine learning use cases, he says. After data scientists working in Spark build a predictive model using Python or R, they can use Splice Machine to extend that model to the data a business application has stored in the relational database.
Some people may wonder whether SQL has much to do with machine learning. Machine learning, after all, uses the power of statistics to automatically draw correlations between certain derived features hidden in data – huge amounts of data, ideally — while SQL is used to do arithmetic with numbers stored in tables.
Zweben has been down that road before. “Machine learning needs SQL,” he says. “And the reason is the power of any machine learning analytic is not algorithmic. It comes from the signal in the data. Getting the data in the right feature vector for the analytic is the secret behind good data science.”
When it’s pointed out that Spark already has a SQL implementation, Zweben agreed that Spark SQL is useful for some things, but argued that it’s not strong enough. “It doesn’t have enough SQL in it,” he says. “It does lot, but it’s not mutable. There’s no updatable capability.”
There’s a certain power and elegance that comes from having a database at your command, as opposed to just a file system that can accept new files but doesn’t let you update existing files. As some of the luster has come off Hadoop’s shine, Splice is well-positioned to find out how much demand there is for a Hadoop-resident relational database.
“Somebody said the other day ‘Why don’t just describe yourself as making Hadoop updateable?” It was an interesting statement,” Zweben said. “That’s what we are. Just like a database makes large scale database tables updateable, delete-able, and query-able, that’s what we do to big data. And we do it in the same way as you did it on relational database management system.”