Too many big data initiatives are science projects that take months of effort, risk failure and require highly trained data scientists with scarce skills. According to a CSC survey, 55 percent of big data projects aren’t completed and many others fall short of their objectives.Read more...
New Ropes for Scaling the SQL Wall
Surveying the database landscape requires more than one pair of binoculars. Two camps with established roots in enterprise soil—SQL and increasingly, NoSQL—have expanded and encroached on one another’s territory, in part because they are still somewhat reliant on one another to survive. While one boasts features honed over years of development, the other flies the bright flag of scalability.
In the war being raged against big data, however, there might not be a clear victor. In fact, some have made the argument that companies need both camps to scale into a new era and colonize the uncivilized masses of teeming, wild data—much of it galloping into real-time application engines.
One such company keen to make such an argument is Splice Machine, which is fresh off a funding round and ready to ride into the sunset with what it calls “the first SQL-compliant database” designed specifically for big data applications.
The rollout of their Splice SQL Engine was set to coincide with the demand for a massively scalable database that chucks the “compromises” of NoSQL or traditional relational databases. With an HBase backbone supporting big data access without requiring a clunky rewrite of existing SQL applications, the company feels it’s able to address the big problems with big enterprise data struggles; scalability, inflexible schemas, high availability, transaction capabilities and trusty SQL optimizations.
We’ve heard talk about these database limitations before, but in this case, it’s worthwhile to see the bigger picture from the standpoint of a company that teeters on the edge of the SQL/NoSQL bridge. In a recent interview, the Splice Machine CEO and co-founder, Monte Zweben, put forth his own definition of the big data applications his company is targeting. The former NASA Ames AI Deputy Branch Chief and software startup soldier claims that his definition is more specific than general “big data” definitions because it is based on a very particular set of needs from users. Specifically, that many companies already have extensive investments in SQL (everything from a number of existing applications to trained personnel), but are hitting the SQL wall on the data volume and complexity front.
“The NoSQL community threw out the baby with the bath water. They got it right with flexible schemas and distributed, auto-sharded architectures, but it was a mistake to discard SQL,” said Zweben. “The Splice SQL Engine enables companies to get the cost-effective scalability, flexibility and availability their Big Data, mobile and web applications require – while capitalizing on the prevalence of the proven SQL tools and experts that are ubiquitous in the industry.”
The Splice Machine definition of big data application is worth paying attention to since it breaks from the tired old framework we’re all used to hearing in the BD conversation. They point to the new breed of enterprise apps that require the sharding or distribution of data across a commodity cluster. Zweben says these applications require the ability perform all CRUD (create, read, update and delete) operations, scaling from a few terabytes into petabytes. Just as important, they need to be able to scale to the petabyte level and beyond without losing the all-important features of a time-tested SQL approach.
With that in mind, their focus on the term “SQL-compliant” has a bit more context. They refer to the database features that developers expect from traditional relational databases, including real-time updates, full SQL support, secondary indices, as well as transactional and join capabilities. The goal is to help developers avoid having to develop these features (often sub-optimally) in their own application code while taking advantage of the benefits SQL provides.
For instance, Zweben says that when it comes to real-time updates, analytic databases that require a re-run of their batch ETL to make a single update aren’t appropriate for most real-time applications. Similarly, many users need the ability to create secondary indices on any column in order to run flexible and high performance queries.
The company is seeing that terabyte-scale, read-only analytical applications are more prevalent, the folks they’re speaking with are looking at incredible data growth and are dreading an impending “forklift upgrade” to keep pace, particularly for performance-hungry real-time applications.
With the users they’re working in mind who are up against a SQL wall, there aren’t any truly workable solutions on the level of what they’ve been cooking. And further, they claim that nothing out there is really SQL-compliant while still addressing the brick wall. For instance, they claim that traditional RDBMSs are obviously SQL-compliant, but they often fail to scale past a terabyte without resorting to manual sharding or specialized hardware. The “big data” databases out there can indeed move past the petabyte barrier, but Zweben says like NoSQL databases, they often have poor SQL-compliance because there are large gaps on the transactional, real-time updating and full SQL language support sides. And when it comes to the NoSQL camp, well the name bars any likeness to SQL and further, says Zweben, these do not—contrary to popular opinion—have the ability to scale at the petabyte field.
“For instance,” argued Zweben, “consider Cassandra, probably the most scalable NoSQL database. It has limited SQL compliance, no joins, no transactions, and weak (eventual) consistency.” He also notes that the largest known Cassandra cluster has over 300 TB of data in over 400 machines, a fact that Apache shares on its Cassandra page.
In the end, however, Zweben says it’s not a simple matter of customers choosing SQL over NoSQL. The brains behind Splice Machine says that it’s more like an issue of customers wanting SQL and NoSQL. “Customers mostly want NoSQL for its scalability (and sometimes schema flexibility). However, since customers have huge investments in SQL already—existing applications, BI tools, SQL analysts and SQL developers—they also want the [SQL] capabilities like joins, strong consistency and transactions that are invaluable and very expensive and risky for each developer to implement individually.”
With this in mind, they’re making the case that companies want to tap into the scalability of NoSQL but with the familiarity and reliability of SQL. “Since we’re built on top of a NoSQL database, we’re bringing the best of both worlds,” adds Zweben.
The team is trying to tap into those two worlds in enterprise IT at the moment—Hadoop (and the companion database pieces) and trusty SQL. They claim they are drafting off the market momentum of Hadoop with their platform’s HBase foundations. This makes decent sense, since there are a number of companies that have climbed aboard the Hadoop express and have an existing HBase deployment but want to find a captain-like interface that speaks SQL.
Further, this blend of the two approaches to big data could hold appeal for companies that are tapping into NoSQL databases that are finding themselves re-implementing what Zweben calls a “poor man’s version” of features like transactions and joins in each application. In fact, the company says they’ve been encountering this half-eared attempt far more than they would have expected.
Other than being rid of the manual sharding for the RDBMS users, there is another piece the company says needs considered, at least from the industry perspective. They pointed to a number of applications, especially real-time personalization in web commerce, personalized treatments through electronic medical records and smart meter applications that require the balance between the scalability and functionality that they’re hoping to provide.