Deep Attempts to Recreate the Single Platform, General Purpose Database
A Portsmouth, NH startup named Deep says that they’re throwing their hat into the ring with the lofty aim of re-architecting the single platform, general purpose database from the ground up.
Says Mike Skubisz, VP of Product Management at Deep, “We essentially threw out a lot of the contemporary best practices and said, let’s take a clean sheet of paper and build a general purpose database that can provide a single platform and do all those things that graph databases, and in-memory databases, and columnar databases – all these specialty things exist because the general purpose databases couldn’t do it.”
The platform they’re attempting is called DeepDB, which they believe can be a unifying platform to bring together several of the different special purpose use cases back into a single database. In many ways, it can be seen as an attempt to reinvent the wheel, but Skubisz insists that their approach has garnered benchmark results worthy of consideration.
He says that in order to build the DeepDB platform, they attacked some of the basic problems that traditional relational databases had. “One is disk drives,” he told us. “Though they’ve gotten a lot faster, they are inherently slow, and that’s why we’ve seen things like in-memory databases become popular – or a lot of people are throwing money at the problem and going with solid state drives to try and speed up the disk problem.”
Among the renovations, they targeted Rudolph Bayer’s B-tree data structure. “The problem is that the nature of the B-tree is that as the data set gets extremely large, it not only becomes unwieldy, but mathematically starts to hit some walls,” he explains.
A chief problem that Skubisz says B-trees have is that as these structures are stored on disk, the act of locating the data is inefficient. “The mechanical aspect of the seek operation is extremely slow and extremely expensive,” he explains. “You might have a disk array that is capable of maybe 10 or 20 gigabytes of throughput as the theoretical bandwidth, but it’s only getting a fraction of that because it’s so busy moving the heads around trying to find data in these tree structures.”
Deep’s approach to this was to invent a new storage mechanism, says Skubisz – a new data structure which he says allows streaming to the disk. “It allows our on-disk storage algorithms to operate like in an ‘append only’ type arrangement – like a log file,” he explains. “It’s much more involved than that, but essentially, from a disk behavior point of view, it looks like a log file, so you can get virtually wire speed throughput off of the disk because we’ve eliminated all of the seeping that happens.”
Skubisz purports that another innovation that Deep has undertaken is in indexing, where they’ve employed a technique they call “summary indexing,” which he says speeds things up considerably. “The summary indexing name is fairly self-descriptive in that it allows us to understand or represent very large sets of data by using a mechanism of summarization, such that we can handle datasets that are in the yottabytes worth of rows in a table. [Using this technique] we are able to navigate to any piece of data we’re looking for within a matter of 6 hops, where traditional technology would have taken 40 or 50 hops to do the same thing.”
He says that in addition to this, DeepDB decouples the location of the data from the values, keeping them outside of the index structures. Where the traditional relational database data structures contain the data that it’s representing, Skubisz says that the DeepDB summary index doesn’t drag around all of the values.
“When you want to get a record out of a database, we can go through very quickly through our index structure, find where the location of what we want is, and then read that value directly off disk,” he explains. “It becomes a hyper-efficient mechanism for not only getting data on and off disk, but it also allows us to represent very large data sets in a relatively small amount of memory.”
Adding to their indexing remodeling, Skubisz says that DeepDB supports constant time indexing. He explains that in a traditional database environment, the act of index management (or adding indexes to your data base) is a very expensive thing, computationally. It becomes a ‘damned if you do, damned if you don’t’ proposition. “In most databases, you try to avoid indexing and only put in those indexes that are absolutely necessary,” he says. “The downside to that is when you do need to find something that you haven’t indexed, it becomes extremely expensive.”
DeepDB, says Skubisz, allows for a much heavier amount of indexing due to the lightweight summarization techniques that they’ve employed. “It allows us to maintain all of our indexing in parallel so you don’t have any lags between when the data gets stored, and when it’s indexable or findable by the application space.”
One of the final areas that Skubisz spoke with us about was regarding concurrency, where he says DeepDB has been designed from the ground up to exploit all of the system resources that can be available to it.
The end results, says Skubisz, is a faster, general purpose database that can handle the demands of the big data future. He says that the company has architected the platform in such a way that will allow them to fit into existing database frameworks, with the first connector coming out this summer which will connect to the MySQL environment.
“This allows us to basically go into any MySQL customer and install our technology,” he explains. “From a MySQL point of view, it looks like a storage engine plugin, so it’s a natural act for MySQL to see it as another storage engine that it can use.”
Deep has some very steep challenges to face as it tries to gain mind and market share for their platform. One of those challenges, Skubisz admits, is in the on-ramp transition from small to big data. “It’s not a tradeoff, per se, but if you used our database in a small data scenario, you might not see any performance improvement – or in some edge cases, you might see that we’re slower than traditional database technologies,” he explains. “That tradeoff is really to optimize for large data sets – not only large in terms of size, [but also] in increasing complexity.”
Skubisz says that while the industry is largely focused on the “Three V’s” of big data (variety, velocity, and volume), they add in “three C’s” in complexity, computational demand on the data, and constraints where time is concerned.
“We had one of our early customers with an application where they were operating a relatively small dataset – just a couple of gigs,” he explained. “However, they had extremely complicated operations they were doing in terms of data summarization and analysis.” He explains that using Microsoft SQL, it was taking them roughly 10 and a half hours to run analysis on the dataset, but without changing any of their application code or database schema – just changing to the DeepDB engine, they were able to drop the process down to just over 16 minutes. “We don’t consider a couple of gigabytes big from a size point of view, but the amount of manipulation they were doing to the data pushed it, in our view, to what we would consider a big data application case.”
He says that in another example, they had a customer handling multi-terabytes worth of data, and just loading the data into the database was taking them north of 24 hours. “We can do that in something like fifty minutes.”
While the DeepDB approach raises more questions than answers, it’s interesting to see that there are still startups cropping up who are trying to be the “Holy Grail” solution in the big data space. Time will tell if they are able to go “deep” with their technology, or drown as another also-ran in the sea of available technology.