Follow Datanami:
June 4, 2013

Hadoop Distros Orbit Around Solr

Isaac Lopez

With recent announcements, the open source search tool, Solr has become a notable focus of at least two Hadoop distro vendors.

MapR and Cloudera have recently made announcements addressing the need for easier access to the data across the organization using search. Last month, as part of their M7 release announcement, MapR gave details on their partnership with LucidWorks, introducing a beta stage integration of LucidWorks Search with the MapR distribution of Apache Hadoop. In conjunction, Cloudera today announced that they, too, are launching a beta search project.

Both search projects involve the Apache Solr search platform that grew from the Apache Lucene project started by Hadoop founder, Doug Cutting.  Solr has been around as an open source project since 2006, and has gained a lot of traction in that time, becoming one of the most widely installed search engines.

As both vendors snap Lucene/Solr into their Hadoop distro, they are taking different routes to the same end. Where MapR has partnered with LucidWorks, a company that was founded around providing commercial support for Solr, Cloudera says that they have hired some of the key committers to the Solr project to integrate it into their framework.

While they both argue about which implementation is the better approach, the real winner in this development are the end users of Hadoop who end up with an array of expanded use cases. By adding search to Hadoop, the distros give a wider range of users the ability to interact with Hadoop data. Charles Zedlewski, VP of Products at Cloudera illustrated this point to us saying, “If you think about all the MapReduce developers there are in the world today, there is probably a few hundred thousand. If you think of all the people that are SQL business analysts, there’s probably a million. But if you think about all the people who know how to use search, Google has more than a billion users by itself.”

To this end, MapR says that some of the users in their beta are creating applications that put the user in contact with the Hadoop data. Tomer Shiran, VP of Product Management with MapR, explained a use case for a gaming company where the company was leveraging the search implementation to produce multiple indexes of all or some of the columns in an HBase table, a utility that provides the ability to do more advanced queries where the lookups aren’t just based on the primary key (which is standard for HBase and standard Hadoop), but also based on other columns. “They end up using that for an application that is exposed to an end user, so the person who actually benefits from having this technology is the actual gamer,” said Shiran.

The search integration is a boost for users who want to provide more business level access to unstructured data in the Hadoop cluster, such as images, documents, or sound and video files, said Zedlewski, noting a use case for Cloudera customer, and agricultural giant, Monsanto.

Monsanto has been using Cloudera as part of their research process to develop new products – seeds with special properties, such as being drought or disease resistant, pesticide tolerant, etc.  The company takes photographs of the seedlings at each stage of development as they move through their life cycle to compare different strains or product iterations.

“Before, they basically just tried to jam all of these images into a database as a series of blobs, which they had to basically manually extract when they wanted to pull up an image,” explained Zedlewski. He notes that with the search capabilities, now they are able to let the researchers explore the images using free text search, noting that with the search implementation, they are able to index all of the images and extract different attributes, making it very easy to navigate and search a large volume of images.

Effective search was the killer app for the data on the Internet, enabling a wave of technology advancement that persists to this day. Whether Hadoop has that kind of gravity remains to be seen, but one thing is certain, the vendors peddling the stuff believe that the industry is ready to be hooked.

More to come…

Related Items

MapR Revs HBase with M7; Plots Search Integration 

Searching Big Data’s Open Source Roots 

Cloudera Releases Impala Into the Wild 

Datanami