June 16, 2017

Yahoo’s Massive Hadoop Scale on Display at Dataworks Summit

Alex Woodie

Yahoo put its massive Hadoop investment on display this week at Dataworks Summit, the semi-annual big data conference that it co-hosts with Hortonworks.

While Hadoop is no longer the conference headliner that it once was, the platform is still critical for the daily operations of Yahoo, which officially became part of Verizon Communications this week when the $4.5 billion acquisition finally closed. With 120,000 servers and 800 PB of in storage, few companies have the computing scale of Yahoo. And as the birthplace of this distributed computing platform called Apache Hadoop, it’s worth keeping an eye on what Yahoo is doing with its collection of big data tech.

Sumeet Singh, the senior director of cloud and big data platforms at Yahoo, took to the Dataworks Summit stage on Wednesday to describe how the technological makeup of Yahoo’s massive cloud platform has evolved over the years.

For starters, the company is moving solidly away from MapReduce. Over the past 17 months, Tez has replaced MapReduce as the underlying engine for many of the batch-oriented Pig and Hive workloads that Yahoo relies on to serve its 1 billion monthly users. Today, 70% of the Hadoop workloads and Yahoo run under Tez, according to Singh. Use of Apache Spark has also grown, but not nearly as quickly as Tez, he says.

Singh referred to this switch from MapReduce to Tez and Spark as “compute shaping.” “What compute shaping does, it allows us to make better use of the platform,” he says. “This is fantastic for the company and our customers because they can make better use of the capacity.”

Tez has steadily replaced MapReduce for batch Hadoop workloads at Yahoo

Singh also provided a glimpse into Yahoo’s use of Apache Storm, which the company has relied on to provide real-time processing of data for the past five to six years. As Singh explains, the company has been keen to modernize its Storm clusters to squeeze more efficiency from them..

“We’ve been constantly in the quest to move old topologies, old tenants, from the old scheduler to the new resource scheduler. That has really come to fruition in the last seven months,” he says. “As these topologies are migrated from the old scheduler to the new scheduler, you can see the compute efficient, for both CPU cost and memory, improves…from the high 40s to the 50s and 60s.”

Hortonworks is also betting on Storm, as it subtly shifts its emphasis away from processing data at rest in big data lakes, to processing data as soon as it arrives over the wire with its Hortonworks Data Flow (HDF) product. Hadoop is not the center of gravity that it once was in the big data space, but it’s still critical for the operaitons of Yahoo and many other users.

“Hadoop has been the quintessential cloud platform for Yahoo. It’s this constant shaping o the platform that’s allowed us to stay strong and mature the platform over these years,” Singh says. “Obviously we see a lot of diversity in our use case….the platform has fostered new paradigms. It’s this evolution which I think is key to a successful platform.”

Yahoo’s Apache Storm clusters have become more efficient with CPU and memory over the past year and a half

In a separate session, two Yahoo employees provided details on the massive compute and storage capacity required to power Yahoo Mail.

Yahoo Mail is unique among public email providers in that it provides 1TB in storage space for each user. With millions of users sending 26 billion emails per day, all that space adds up in a hurry, and the end result is a 50PB warehouse dedicated to storing Yahoo Mail customers’ email.

With so many emails flying around, keeping the inbox neat and tidy is important. That’s why Yahoo uses machine learning technology to “help the user keep organized,” says Yahoo engineer Nick Huang. “Mail has become people’s personal database.”

The machine learning technology is also critical to weeding out the junk mail. Approximately 90% of the emails received by Yahoo Mail clients are generated by machines, and 80% of them are spam.

Yahoo runs its own servers—it’s not an Amazon Web Services, Google Compute Cluster, or Microsoft Azure cloud client, thank-you-very-much. That means that each day it must move mail among its various global data centers by itself. To keep the Yahoo Mail running between data centers, the company relies on its own version of the Pony Express: the Data Highway Rainbow.

According to Yahoo engineer Saurabh Dixit, the Data Highway Rainbow disseminates upwards of 800TB of data represented in 250 billion user events across the various data centers that Yahoo has placed around the world.

IBM Throws In with Hortonworks’ Hadoop

Applications: Enterprise Analytics

Technologies: Cloud, Frameworks, Middleware, Processors, Storage, Systems

Sectors: Retail

Vendors: Hortonworks, yahoo

Tags: batch, Hadoop, mapreduce, Spark, storm, stream processing, tez

Only registered users may comment. Register using the form below.

Check off newsletters you would like to receive*
- HPCwire
- EnterpriseTech
- Datanami
- Technology Conferences & Events
- Advanced Computing Job Bank
- Technology Product Showcase
Email*
Name*
First Last
Organization*
Job Function*
Industry*
Country*
City*
State*
Province*
- Please check here to receive valuable email offers from Datanami on behalf of our select partners.

Yahoo’s Massive Hadoop Scale on Display at Dataworks Summit

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 23, 2024

April 22, 2024

April 19, 2024

April 18, 2024

Sponsored Partner Content

Get your Data AI Ready – Celebrate One Year of Deep Dish Data Virtual Series!

Supercharge Your Data Lake with Spark 3.3

Learn How to Build a Custom Chatbot Using a RAG Workflow in Minutes [Hands-on Demo]

Overcome ETL Bottlenecks with Metadata-driven Integration for the AI Era [Free Guide]

Gartner® Hype Cycle™ for Analytics and Business Intelligence 2023

The Art of Mastering Data Quality for AI and Analytics

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Top 6 Strategies for Reducing Data Warehouse Costs

Building an Operational Data Warehouse for Real-time Analytics

Sponsored Multimedia

The Power of DataOps: Bring Automation to Life
No Comments

Tactical Steps for Cloud Migration
No Comments

Immuta Data Access Platform
No Comments

Data Mesh: Fact or Fiction?
No Comments

Contributors

Featured Events

Call & Contact Center Expo

AI & Big Data Expo North America 2024

AI Hardware & Edge AI Summit Europe

AI Hardware & Edge AI Summit 2024

CDAO Government 2024

Yahoo’s Massive Hadoop Scale on Display at Dataworks Summit

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 23, 2024

April 22, 2024

April 19, 2024

April 18, 2024

Most Read Features

Most Read News In Brief

Most Read This Just In

Sponsored Partner Content

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Sponsored Multimedia

Contributors

Featured Events

Share

Copy short link