DeepSQL Kicks Evolutionary Genetic Research Into High Gear
The Ruđer Bošković Institute (RBI) in Croatia blasted through a bottleneck in its evolutionary genetic research by implementing a parallelized storage engine for MySQL from Deep Information Sciences that serves genomic data to its analytic computer cluster.
RBI is Croatia’s leading scientific institute, with 850 scientists conducting research into various natural and biomedical fields. In particular, its Department of Molecular Biology has created novel ways of looking into how human genes evolved by comparing each human gene to all the genes in the world’s organisms.
This research requires good stewardship of data. Prior to implementing the DeepSQL storage engine from DeepIS, the RBI relied on a MySQL database from Oracle equipped with the InnoDB storage engine from Percona to ingest, prep, and serve terabytes worth of fresh genomic data to a 100-node cluster that does the heavy analytic lifting. The MySQL database lived on a single-node cluster equipped with 200GB of RAM and 2TB.
This setup worked fine when the size of the genomic data set was relatively small. But as the research project ramped up, the researchers found that database was quickly becoming a bottleneck.
“The database is not growing slowly–its growing faster and faster because the cost of sequencing goes down,” explains RBI’s Dr. Martin Sebastijan Šestak, a post-doctoral researcher at RBI’s Department of Molecular Biology, Laboratory of Evolutionary Genetics. “That’s actually a big problem for us. The number of genomes is constantly increasing, and we want to keep up to date with that information, which means we need to constantly update our database.”
Eventually, it took Šestak’s team three to four days to load fresh genomic data from various public sources into the single-core MySQL database, which currently has data measuring in the hundreds of millions of rows. Then it took another one to two days to run the queries on the high-performance computing cluster. While the size of the data wasn’t particular huge, the need to continually join 50GB tables into the database was becoming a real bottleneck.
“As our database grew to 250GB with joins, larger than our 200GB RAM server, InnoDB got slower and slower,” he says. “Everything slowed down to a crawl. It was impossible to get anything done on schedule.”
Šestak looked at different database technologies, including Percona‘s TokuDB, an open source storage engine that plugs into MySQL, and essentially replaces InnoDB. While performance improved a bit with TokuDB due to its use of fractal tree indexing, it was still not as fast as Šestak desired.
While attending a Percona conference in Santa Clara this spring, Šestak heard of another MySQL storage engine called DeepSQL. “I saw that there were some new storage engines that I didn’t hear about before,” he tells Datanami, “so I decided to download it and benchmark it against other solutions.”
DeepSQL, if you’re not familiar with it, replaces the B-Tree indexing used in most database storage engines with something that DeepIS calls Continuously Adaptive Sequential Summarization of Information, or CASSI. Instead of continually writing data to disk, CASSI uses machine learning algorithms to better predict the optimal moment to write data to disk, based on the particular configuration and capability of a computer. It also implements parallelism to boost performance.
These approaches can erase bottlenecks in an analytics pipeline (or at least push them elsewhere). When DeepIS launched the technology at the Percona conference earlier this year, it claimed a MySQL database equipped with the DeepSQL storage engine could up to 64 times faster over a highly tuned instance of InnoDB. This “hyper-indexing” capability of a DeepSQL database makes it seem like it’s running on SSDs, even if it’s on plain old HDDs, the company claims.
What’s more, DeepSQL delivers all this performance boost without requiring underlying changes to the database or new APIs for the application, since it plugs into the MySQL architecture.
Earlier this year, RBI switched to DeepSQL Community Edition (free for organizations with less than $1 million in annual revenues) to power the genomic database. The impact on performance was dramatic and instantaneous.
According to RBI, the periodic uploads of fresh data take just one day, instead of the previous four. Data load times are three times faster than under TokuDB, while queries run five times faster. DeepSQL also shrank RBI’s storage footprint by delivering 40 percent compression.
This has freed Šestak and his colleagues to concentrate on their research rather (they have developed a geologic-like method of gradually uncovering genetic “layers” in the genomic codebase to track evolution) than fiddling with computers. “When the size grows even bigger, I’ll still be able to analyze that without moving it to Hadoop or some other technology that I need to learn and administer,” he says.
Speed and scalability are critical when you’re dealing with the type of data that RBI is, says Deep Information Sciences’ Chief Strategy Officer Chad Jones. “Tuning the application and tweaking my MySQL configurations only get you so far–they consume a lot of time for not a lot of reward,” he says. “We’re thrilled that RBI is using DeepSQL to supercharge its research into biological evolution and that, even without a DBA, they’re able to achieve orders of magnitude better performance and scale from their MySQL environment.”
The parallelization of the DeepSQL storage engine gives RBI lots of headroom to grow their analytic pipelines. “They [Oracle, the owners of MySQL] really should implement parallel processing,” Šestak says. “But it’s not really a priority for them. But it is for us, and that’s why we fit DeepSQL into our pipeline.”