Elastic MapReduce Lead Traces Big Data Clouds
Peter Sirota, General Manager of Amazon’s Elastic MapReduce service, spoke at length at the AWS Gov Summit 2011 about the benefits of Hadoop and MapReduce and the company’s own Elastic Map Reduce offering. He claims these technologies are dramatically lowering the cost of development and experimentation for emerging “big data” applications and are broadening the range of new mission-critical applications that require scalable infrastructure.
During his talk, Sirota pointed to a number of Elastic Map Reduce use cases in the context of household-name web-based businesses, including Etsy, Foursquare, and Clickstream. One of the more interesting use cases he mentioned involved the review giant Yelp, a still-expanding review site with more than 50 million monthly visitors and over 18 million reviews stockpiled.
Standard visitor activity across the site generates roughly 400 GB of data per month, accumulating to 12 terabytes per month. This data does not sit idle on backup storage somewhere, however. The company uses this data to power nearly all of the key features that make Yelp stand out from the rest, including its ability to make creepily on-target recommendations about similar activities in an isolated region, for instance.
Sirota spent considerable time discussing how the mega-review website Yelp uses the service to handle everything from its advanced recommendation engine-based features to…spell check. Yes. Spell check.
While spell check technology is considered a last century innovation, there is nothing old school about the way Yelp handles the problem of mistyped or misspelled words. Using Elastic MapReduce, the company is rather innovative both in the way it leverages AWS’ hosted Hadoop and seamlessly integrates into customer-facing application.
To arrive at eerily correct options for any number of misspelled words or locations, Yelp uses AWS Elastic Map Reduce to go far beyond the red underlining you see in your word processor. The company takes 6 months of search engine data based on what users type into the site’s bar, loads this massive dataset into S3, and promptly spins up a 200-node cluster in EC2. Being fully integrated with S3, the data is sucked in and within a few hours, the rather simple algorithm has created a “misspelling map” that is loaded into the application. Now, when users go to type in “La Hoya” hotel, just as they type the first few letters, right under the bar is the correct option (based on the mistakes of others), “La Jolla hotel.”
AWS says that Yelp uses a range of different applications, all of which have their own custom configurations. While it is possible to use the standard flavor of regular instances, AWS claims that Yelp uses many different cluster sizes with different types of nodes; some are CPU-intensive, some I/O bound while others are memory hogs. They say they spin up roughly 250 different clusters each week, all of which are fine tuned for their purpose.
Sirota says this is one advantage for developers at companies like Yelp. Instead of investing the hardware required to test new applications or services, then to take that finalized application and fine-tune it to the hardware, it’s possible to spin up a test/dev cluster in moments and invest nothing more than a few dollars and some time. Furthermore, since it is possible to manipulate memory, I/O, compute boost and other elements, there is no need to tailor applications to fit the machines they are dedicated to.
The company’s review highlight feature is one example of how big data is being turned around to produce immediate value. Yelp’s review highlight feature takes all the reviews in its extensive collection about restaurants and mines for the key features and limitations of restaurants, then applies this to a particular restaurant based on eatery-specific review data to allow site visitors to see the overall sentiments without scanning through each review.
The video below is the full presentation that Sirota gives on Yelp and other use cases. He defines their data warehousing architecture toward the end and provides some insights about new uses for Elastic Map Reduce. While it’s over 30 minutes long, there are some key points he makes that are summarized below if you’re short on time.
Sirota says that big data requires new approaches at both the hardware and software. He says that this due to some unique challenges that were not present before there were endless data streams from social media, a vast array of sensors and mobile devices, and more complex, heavy computer generated data sources.
This data can be mined across a number of channels, using technologies like sentiment analysis, recommendation engine-based tools, and standard text analysis. However, the volume alone makes these datasets a challenge. The complexity increases when what used to fit in one box now needs to be scaled into a distributed system.
Furthermore, Sirota says that another challenge of big data is the structure, especially as more data scientists are trying to consolidate data from multiple sources, formats and businesses into an application. For instance, it would be quite difficult to merge Salesforce, Facebook and LinkedIn data with the rapid-fire Twitter firehose to find answers to questions that you hadn’t solidified yet.
This is where MapReduce and Hadoop enter the picture. Sirota says that using Amazon Web Services’ hosted Hadoop, Elastic Map Reduce, data scientists can leverage the advantages of Hadoop and MapReduce on an “infinitely” scalable infrastructure; changing cluster sizes, spinning up new ones to meet additional demands, and even running development clusters on the same datasets that are working in the background on mission-critical problems via the Amazon S3 (and HDFS capable) backbone.
As Sirota said in his talk, “If you combine EC2 and Hadoop you get Amazon Elastic MapReduce (Hadoop hosted on EC2)—you get a fundamental shift in the economics of data processing. Elastic Map Reduce lets you focus on developing your application without having to worry about operating infrastructure.” In addition to being the largest cloud provider at present, the company has, arguably, made some of the most significant strides to enhance access to big data tools and technologies to free developers to create without the hassles of turning to (or becoming) system admins.