Follow Datanami:
October 24, 2012

MapR Traces New Routes for HBase

Nicole Hemsoth

It’s showdown mode for the Hadoop distro vendors who have gathered in New York City for this year’s Strata Conference and Hadoop World 2012 event. Among that list of companies vying to spin up the best platform for a growing community of Hadoop users is MapR, which hosted an event just a few blocks from the main action today. The event, hosted in conjunction with Google, was centered around their newest M7 platform refresh.

Each of the vendors has tackled a specific component of the Hadoop puzzle in an effort to appeal to the widest base of users, making the task of comparing the various Hadoop distributions even trickier.

Of course, if you ask MapR, there’s no trick about it—they leaped at the opportunity to illustrate what separates their enterprise-grade services from the others. Among other things, they claim their distinguishing feature is their NFS access ability, a clever workaround for fault tolerance and reliability, and some enhancements to boost performance and scalability.

At the root of some of their recent work for M7 on the speed, scale and access fronts is HBase, which was where they put the bulk of their development efforts in advance of today’s announcement of their boosted platform. The last enterprise distribution, M5, addressed reliability and access shortfalls, but the newest release targets performance more distinctly.

According to MapR, HBase is a natural target area since it’s becoming a stable part of many production environments. The company says that around 40% of Hadoop users opt for HBase as their touchstone non-relational distributed database, in part because of the simple fact that it runs natively on top of HDFS in a Hadoop cluster environment. It’s not that there aren’t other NoSQL options out there, however, said MapR’s Jack Norris.

Despite its longevity and decent adoption level, HBase still has the reputation of an immature database approach, but MapR thinks they’ve pinned down a few tricks on the recovery side, including providing robust snapshot and mirroring capabilities. MapR’s lead software architect demonstrated the mirroring and snapshot functionality for us during the event today, pointing to the relative speed and ease with which one could immediately pull up snapshots and start rolling ahead with the application again. On top of that, the company claims that even with concurrent, repeated hardware or software outages, applications will keep running without admin alarm bells demanding immediate attention.

The tweaks are not just about reliability and failover; performance optimization is the key to this release. HBase tends to go through several processes that generate a lot of I/O overhead, so the team tried to eliminate these mini-bottlenecks, with some significant performance increases over their last (M5) releases. MapR says that among other approaches, they’ve managed to eliminate the need for compactions, which means M7 can address uniform and consistent performance needs. Additionally, by utilizing innovative data structures that minimize the read- and write-amplification factor, inserts and updates are much faster. In addition, they say that since M7 also supports in-memory columns users have more options to increase database performance.

“If you look at HBase in the context of the loads of other NoSQL databases that are out there, we think we have an advantage in terms of offering better scalability, especially when you look at MongoDB or Cassandra, for example,” Norris told us today. He pointed to the scalability tweaks in M7 that he claims allow users to handle more than a trillion tables. This scalability is enhanced by the addition of more column families and expanded row and cell sizes.

Of course, none of this is useful without the ability to manage effectively. The company says that M7 greatly simplifies HBase administration by ensuring there are no separate processes to monitor and manage, no manual compactions, no manual region merges, no pre-splitting, no manual database repair operations and no downtime for standard maintenance.

Some have stated that the problem with MapR’s approach is that it creates a big data “lock-in” situation via the proprietary replacement of HDFS within its distro. The company was careful to skirt this criticism, noting that the NFS access capabilities actually provide a more open environment than users can get with the other distros. Further, the company says that when large-scale customers who are looking for a highly reliable platform evaluate their technology they are concerned with the best solution for the job. In other words, what they seem to be suggesting here is that the lock-in creates a tricky situation, but if the performance, reliability and access are solid enough for what that user is trying to accomplish, it’s a price they’ll pay.

Despite the clear competition in the ever-growing ecosystem around Hadoop, Norris said that the company’s future, especially as organizations look to put Hadoop into production across an ever-increasing mission-critical, is looking bright. Additionally, he pulled some numbers from job giant showing the steep climb in the numbers of jobs requiring some familiarity with Hadoop as a sign of growth, noting that this also means that organizations looking at Hadoop want developers who can do more with less time and effort—and this their sweet spot in Norris’ view.

The main drivers for Hadoop adoption overall are strong, said Norris, pointing to the reduced cost of storage as one of the most basic. “If you look at the costs of storage alone, companies are being given all the incentive they need to keep all their data.” Beyond that, the fact that users are no longer painted in the corners of their own purpose-built models and questions is another driver. Instead of being tasked with creating certain questions, using Hadoop means that it’s not necessary to know what questions you want to ask of your data beforehand.

While MapR is seeing a clear uptick in interest around Hadoop, they are confident that users who are evaluating the range of solutions around the platform are going to opt in favor of reliability and scalability at this point. These are two features the company targeted with this update—but the performance piece is where it will get really interesting across the entire ecosystem with next year’s presumed round of distro upgrades.


Related Articles

MapR Floating Google Cloud

Greenplum, Kaggle Team Up to Prospect Data Scientists

Six Super-Scale Hadoop Deployments