June 28, 2013

Yahoo! Spinning Continuous Computing with YARN

Isaac Lopez

YARN was the big news this week, with the announcement that the Hadoop resource manager is finally hitting the streets as part of the Hortonworks Data Platform (HDP) “Community Preview.”

According to Bruno Fernandez-Ruiz, who spoke at Hadoop Summit this week, Yahoo! has been able to leverage YARN to transform the processing in their Hadoop cluster from simple, stodgy MapReduce, to a nimble micro-batch engine processing machine – a change which they refer to as “continuous computing.”

The problem with MapReduce, explained Fernandez-Ruiz, VP of Platforms at Yahoo!, is that it once the processing ship has launched, new data is left at the dock. This creates a problem in the age of connected devices and real time ad serving, especially when the network your running is trying to make sense of 21 billion events per day, and you’ve got users and advertisers counting on you to be right.

Having MapReduce batch jobs that take anywhere between two and six hours, Yahoo! wasn’t getting the fidelity that they needed in their moment-to-moment information for such things as their Right Media services, their personalization algorithms, and their machine learning models. Problems such as this became an impetus for their getting involved with Hortonworks and the Apache Hadoop community to develop YARN, said Fernandez-Ruiz.

“We figured out that we had to change the model, and this is what we’re calling ‘continuous computing,’” he explained. “How do you take MapReduce and change this notion of [going from] big long windows of several hours to running in small continuous incremental iterations – whether that is 5 minutes, or 5 seconds, or a half a second. How do you move that batch job from being long running, to being micro-batch?”

One of the solutions they’ve turned to, says Fernando-Ruiz, is to leverage YARN to move certain processes from big batch jobs to streaming. To accomplish this they turned to the open source distributed real-time computation system, Storm.

Pivoting off of YARN, Yahoo! was able to use Storm to reduce a process window that was previously 45 minutes (and as long as an hour and a half) long, to sub-5 seconds, correcting a problem that they had with unintentional client over-spend on their Right Media exchange.

Fernando-Ruiz says that they currently have Storm running in this implementation on 320 nodes of their Hadoop cluster, processing 133,000 events per second, corresponding roughly to 500 processes that are running with 12,000 threads. (Fernando-Ruiz says that their port of Storm into YARN has been submitted into the Storm distribution, and is now available for anyone to use).

He explained that they are also using YARN to run UC Berkely AMPLab’s data analytics cluster computing framework, Spark, in conjunction with MapReduce to help with personalization on their network. According to Fernando-Ruiz, Yahoo! is using long-running MapReduce to calculate the probabilities of an individual users interests (he used the example of a user having a fashion emphasis). While the batch runs, Spark is continuing to score the user.

“If you actually send an email, perform a search query, or you have been clicking on a number of articles, we can infer really quickly if you happen to be a fashion emphasis today,” he said. They then take the data and feed it into a scoring function, which flags the individual according to their interests for the next few minutes, or even 12 hours, delivering personalized content to them on the Yahoo! network.

According to Fernando-Ruiz, this deployment with Spark is currently running as a much smaller deployment of 40 nodes, and that the continuous training of these personalization algorithms has experienced a 3 times speed up. (Again, Spark has been ported to YARN, and is currently available for download).

The final use case he discussed uses HBase, Spark, and Storm running on 19,000 nodes for machine learning to train their personalization models. With data constantly accumulating, Yahoo! looks towards using the flood to calibrate and train their models and classifiers.

“The problem is that every time you load all that data…it’s a long running MapReduce job – by the time you finish it, you’re basically changing maybe 1% of the data. Ninety-nine percent of the data is that same, so why are you running the MapReduce job again on the same amount of data? It would be better to actually do it iteratively and incrementally, so we started to use Spark to train those models at the same time we’re using Storm.”

He says that Hadoop 2.0 is changing the way that they view Hadoop altogether. “It’s no longer just these long-running MapReduce jobs – it’s actually the MapReduce jobs together with an ability to process in very low latency the streaming signals that we get in. To not have that window of processing , but actually have a very small window of processing together with the ability to not have to reload all the data set in memory…to go and iteratively and incrementally do those micro-batch jobs.”

All that, he says, is possible thanks to YARN, and the new development of splitting the resource manager.

The Art of Scheduling in Big Data Infrastructures Today

Gartner’s Adrian Raps on Big Data’s Present and Future

Applications: Complex Event Processing, Enterprise Analytics, Predictive Analytics, Research Analytics

Technologies: Frameworks, Processors, Systems

Sectors: Financial Services, Government, Healthcare, Manufacturing, Other, Retail, Science

Vendors: Hortonworks

Tags: Bruno Fernandez-Ruiz, continuous computing, Hadoop, Hadoop Summit, HBase, hdp, Hortonworks, mapreduce, Spark, storm, UC Berkeley AMPLabs, Yahoo!, yarn

Only registered users may comment. Register using the form below.

Check off newsletters you would like to receive*
- HPCwire
- EnterpriseTech
- Datanami
- Technology Conferences & Events
- Advanced Computing Job Bank
- Technology Product Showcase
Email*
Name*
First Last
Organization*
Job Function*
Industry*
Country*
City*
State*
Province*
- Please check here to receive valuable email offers from Datanami on behalf of our select partners.

Yahoo! Spinning Continuous Computing with YARN

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 19, 2024

April 18, 2024

Sponsored Partner Content

Get your Data AI Ready – Celebrate One Year of Deep Dish Data Virtual Series!

Supercharge Your Data Lake with Spark 3.3

Learn How to Build a Custom Chatbot Using a RAG Workflow in Minutes [Hands-on Demo]

Overcome ETL Bottlenecks with Metadata-driven Integration for the AI Era [Free Guide]

Gartner® Hype Cycle™ for Analytics and Business Intelligence 2023

The Art of Mastering Data Quality for AI and Analytics

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Building an Operational Data Warehouse for Real-time Analytics

Can You Use Kafka as a Database?

Sponsored Multimedia

The Power of DataOps: Bring Automation to Life
No Comments

Tactical Steps for Cloud Migration
No Comments

Immuta Data Access Platform
No Comments

Data Mesh: Fact or Fiction?
No Comments

Contributors

Featured Events

Call & Contact Center Expo

AI & Big Data Expo North America 2024

AI Hardware & Edge AI Summit 2024

CDAO Government 2024

Yahoo! Spinning Continuous Computing with YARN

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 19, 2024

April 18, 2024

Most Read Features

Most Read News In Brief

Most Read This Just In

Sponsored Partner Content

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Sponsored Multimedia

Contributors

Featured Events

Share

Copy short link