Follow Datanami:
August 13, 2014

MapR Says Its Hadoop Tweaks Scale to Meet IoT Volumes

Hadoop is naturally positioned as a key architectural platform that organizations will turn to for analyzing Internet of Things (IoT) data. But according to the folks at MapR Technologies, limitations in plain vanilla Apache Hadoop make it unsuitable for big IoT workloads.

We’re at the beginning of a huge boom in the IoT. Gartner estimates the Internet will have 25 billion connected devices by 2020, while Cisco pegs the number at 50 billion. Any way you cut it, the amount of data generated by the billions of network-connected devices–not to mention about 7 billion connected humans–will continue its geometric growth curve.

The storage crunch will thrust Hadoop and other massively parallel computing platforms into the IoT limelight. However, the way the Hadoop Distributed File System (HDFS) stores data is not necessarily a good match for the IoT. The complication has to do with several architectural design factors, including the way HDFS relies on a single NameNode server to store the metadata that describes all of the individual files stored across a cluster.

The limitation means that HDFS in plain vanilla Apache Hadoop implementations tops out at around the 100-million file mark, says Jack Norris, chief marketing officer for MapR Technologies. “Depending on how you constructed it and deployed that, once you hit 100 million, you have to start doing unnatural acts to increase the limit,” he tells Datanami.

While it may sound like a lot, 100 million files barely gets you in the door of the machine-generated big data world, where billions or even trillions of individual events may need storing. Those volumes are not a good match for HDFS and its original design as a write-once, read many file system for batch-oriented workloads.

While Hadoop is great at ingesting massive amounts of data stored across relatively big files, it’s not so great at performing random reads over relatively small files. Sensor data generated from servers, cellphones, cars, trucks, refrigerators, thermostats, jet engines, wind turbines, furnaces, cameras, cats, and everything else on the IoT will be very high in volume, but each file will be relatively small in size.

One common way of getting around the 100 million barrier (which is reduced to 50 million in some high availability Hadoop setups) is implementing multiple NameNodes in a federated construct, and then sharding the metadata across those servers. That works but adds a high degree of complexity.

Then there’s the approach that Facebook and other big Web properties have taken, “that spend a lot of time taking small files and compacting them, concatenating them into larger files. Then when they need them, they break out those large concatenated files into small files,” Norris says. “So there’s a lot of intents at working around this severe limitations in how Apache Hadoop was architected.”

Hadoop’s limitations have also spurred research and adoption of faster and more nimble in-memory streaming analytic tools that sit atop (or next to) Hadoop, such as Apache Storm and Apache Spark and Kafka, among others. We’ve also seen some promising proprietary products, like DataTorrent, an in-memory tool that can process billions of events per second (obviously not writing them to HDFS as individual files).

The demand for real-time operational analytics on IoT data has also given NoSQL and next-gen SQL (or NewSQL) database vendors something to shoot for. In many instances, organizations have placed a high RAM-count NoSQL or scale-out SQL database in front of a Hadoop cluster to perform certain light analytic functions as the data flows by. These customers often still use Hadoop for the big data mining projects and to train their machine learning algorithms. But they often turn to nimbler databases, like MongoDB or MemSQL, to automate the real-time decision-making.

MapR Technologies did several things to get around HDFS’ limitations and enable its Hadoop distribution to store 1 trillion files. The biggest changes include making core architectural improvements to the file system and integrating the HBase NoSQL database deeply into its distribution.

“The way we got to 1 trillion is we don’t have that fixed NameNode. We distribute all the metadata across the cluster. All nodes participate, so it’s a very linear scaling process as you expand,” Norris says. “We took a look at the file system and made sure it was robust and could handle complete random read/write… Then we followed up with the MapR database, the M7 edition, which has that [HBase] Big Table type of functionality, where you have database processing integrated in with big data.”

MapR’s co-founder and CTO M.C. Srivas was one of the original Big Table developers at Google, and he brought his extensive knowledge of distributed processing and storage to bear on MapR’s Hadoop distribution, which is widely considered the most heavily modified version of Hadoop available on the market, and the farthest away from the main Apache trunk.

To be sure, other Hadoop distributors are aware of the issues and are working with the same deck of cards. MapR has accepted the tradeoff of not adhering completely to Apache Hadoop specs, and modifying the open source code where it sees a potential advantage. If customers see advantages in having a Hadoop distribution that speaks NFS (not just HDFS) and scales to 100 trillion files, then they need to be prepared to wade off the well-travelled open source track and into MapR’s proprietary paths.

Norris is unapologetic about not adhering 100 percent to Apache Hadoop. “It’s not as if we just anticipated this and did some stuff six months ahead of the open source community,” he says. Surpassing the 100-million file limit “required heavy investment–two years of stealth mode investment in the architecture.”

The company thinks the mods will be well worth it as the IoT and operational analytic requirements stress the scalability of plain vanilla Hadoop deployments. But it’s not the only one tooting its own horn, pointing to the $80 million investment that Google made recently in MapR Technologies.

“It’s perhaps no surprise then that Google Capital led our latest round of investment, and they did that after giving some pretty deep due diligence on the technology and by talking to our customers on how they’re using the technology,” Norris says.

Related Items:

MapR Announces $110M Investment Led by Google

MapR Embraces Co-Existence with Hadoop Update

How Streaming Analytics Helps Telcos Overcome the Data Deluge