HBase Rival Steps Out of the Shadows
This week Hypertable, the company behind the self-titled open source, scalable database that was made in the image of Google’s own BigTable database, announced a series of new offerings that are suited to big data applications that they say beat out some commonplace replacements, including HBase.
The database project, which originated with CEO, Doug Judd, who was working with the Inktomi search division in the late 1990s, has some adoption across a number of industries, including life sciences, financial services and engineering. The pitch Judd puts out to companies is that they can get maximum efficiency and performance out of the database’s inherent scalability with a supported (and training program-fed) open source product.
While it’s always best to tout the results of external benchmarking efforts, the company says its own tests using Hypertable against HBase within a real-world workload example revealed rather sharp differences—not necessarily in terms of general capacity, but rather in terms of overall utilization of available resources.
They claim that these results “illustrate how Hypertable can deliver equivalent database capacity on a fraction of the hardware, translating into less equipment, less power consumption and less datacenter real estate, resulting in lower overall costs.” In addition to the claims of efficiency gains, the company’s offering, which was composed in the key of C++ almost exclusively, plays well with Hadoop DFS, GluserFS and the less-used Kosmos FS, which gives it some flexibility of use.
To get to the heart of the performance to footprint claims, we talked with the CEO of Hypertable, Doug Judd, both to see how their database leverages the Bigtable architecture to achieve their own benchmarking results—and where they will fit in within an already very noisy big data ecosystem.
On the high level, what are the differences between HBase and what you’re offering?
From a design standpoint, HBase and Hypertable are very similar. The primary difference at a high level is that Hypertable offers much better performance, primarily due to choice of implementation language. This means that Hypertable can deliver the same database capacity on less hardware (fewer servers), consuming less power, and less datacenter real estate, which translates to lower overall cost.
Can you provide more details on the real-world workload-driven tests you talk about that allowed you to show your advantage over HBase?
Many common big data applications manage a massive amount of very small items, for example, URLs (< 100 bytes), sensor readings (10 – 100 bytes), genome sequence reads (~100 bytes). By including tests that exercise the systems at these smaller value sizes, the test better models these common use cases. Also, without getting too technical, the Zipfian random read tests model typical human access patterns, where you have a relatively small number of popular items and a larger number of less popular items. In both the small value-size and Zipfian read tests, Hypertable demonstrated much better performance than HBase.
Given that there are already a number of products firmly entrenched in this space, how do your offerings fit into the broader ecosystem, both in terms of their status as supported open source and as competition in what might be considered an already very crowded market?
The Hypertable project was one of the first projects in the scalable big data space. In fact, the HBase project and the Hypertable project started out in life as the same project back in early 2007. Hypertable Inc. currently has six customers, acquired with less than 1/2 million dollars of funding. Cloudera has raised $76 million to date and has acquired 21 customers according to their website. I would argue that we are healthier than Cloudera and are one of the companies entrenched in the market, crowding out the others.
Do you have a higher profile use case for Hypertable that we could discuss?
One interesting use case is the work that we’ve done with the UCSF-Abbott Viral Diagnostics and Discovery Center to build a deep genome sequencing system for novel virus discovery. The capacity to generate genome sequence data has grown exponentially over the past few years and the traditional tools for storing and accessing this data can’t keep up with the volume.
Recognizing this trend, the VDDC teamed up with Hypertable Inc. to deploy Hypertable, a next-generation, massively scalable NoSQL database, designed specifically to handle big data problems of this magnitude in a cost-effective manner. The solution consists of Hypertable running on Hadoop, deployed with the Mesos resource manager on a ten-node cluster of commodity class servers.
Digital DNA read data is loaded into a table in Hypertable, a series of parallel computation are run as Hadoop MapReduce jobs, and the reads that match to known genome databases are annotated in the table in Hypertable. When the processing is finished, the DNA reads that have no annotations are considered novel.
UCSF has a policy of not endorsing any products, so unfortunately we can’t get a quote for from anyone at UCSF about this project. However, take a look at, A Genome Sequence Analysis System Built with Hypertable for slides from a talk we presented at NoSQL Now! 2011.