Follow Datanami:
June 16, 2017

Yahoo’s Massive Hadoop Scale on Display at Dataworks Summit

Yahoo put its massive Hadoop investment on display this week at Dataworks Summit, the semi-annual big data conference that it co-hosts with Hortonworks.

While Hadoop is no longer the conference headliner that it once was, the platform is still critical for the daily operations of Yahoo, which officially became part of Verizon Communications this week when the $4.5 billion acquisition finally closed. With 120,000 servers and 800 PB of in storage, few companies have the computing scale of Yahoo. And as the birthplace of this distributed computing platform called Apache Hadoop, it’s worth keeping an eye on what Yahoo is doing with its collection of big data tech.

Sumeet Singh, the senior director of cloud and big data platforms at Yahoo, took to the Dataworks Summit stage on Wednesday to describe how the technological makeup of Yahoo’s massive cloud platform has evolved over the years.

For starters, the company is moving solidly away from MapReduce. Over the past 17 months, Tez has replaced MapReduce as the underlying engine for many of the batch-oriented Pig and Hive workloads that Yahoo relies on to serve its 1 billion monthly users. Today, 70% of the Hadoop workloads and Yahoo run under Tez, according to Singh. Use of Apache Spark has also grown, but not nearly as quickly as Tez, he says.

Singh referred to this switch from MapReduce to Tez and Spark as “compute shaping.” “What compute shaping does, it allows us to make better use of the platform,” he says. “This is fantastic for the company and our customers because they can make better use of the capacity.”

Tez has steadily replaced MapReduce for batch Hadoop workloads at Yahoo

Singh also provided a glimpse into Yahoo’s use of Apache Storm, which the company has relied on to provide real-time processing of data for the past five to six years. As Singh explains, the company has been keen to modernize its Storm clusters to squeeze more efficiency from them..

“We’ve been constantly in the quest to move old topologies, old tenants, from the old scheduler to the new resource scheduler. That has really come to fruition in the last seven months,” he says. “As these topologies are migrated from the old scheduler to the new scheduler, you can see the compute efficient, for both CPU cost and memory, improves…from the high 40s to the 50s and 60s.”

Hortonworks is also betting on Storm, as it subtly shifts its emphasis away from processing data at rest in big data lakes, to processing data as soon as it arrives over the wire with its Hortonworks Data Flow (HDF) product. Hadoop is not the center of gravity that it once was in the big data space, but it’s still critical for the operaitons of Yahoo and many other users.

“Hadoop has been the quintessential cloud platform for Yahoo. It’s this constant shaping o the platform that’s allowed us to stay strong and mature the platform over these years,” Singh says. “Obviously we see a lot of diversity in our use case….the platform has fostered new paradigms. It’s this evolution which I think is key to a successful platform.”

Yahoo’s Apache Storm clusters have become more efficient with CPU and memory over the past year and a half

In a separate session, two Yahoo employees provided details on the massive compute and storage capacity required to power Yahoo Mail.

Yahoo Mail is unique among public email providers in that it provides 1TB in storage space for each user. With millions of users sending 26 billion emails per day, all that space adds up in a hurry, and the end result is a 50PB warehouse dedicated to storing Yahoo Mail customers’ email.

With so many emails flying around, keeping the inbox neat and tidy is important. That’s why Yahoo uses machine learning technology to “help the user keep organized,” says Yahoo engineer Nick Huang. “Mail has become people’s personal database.”

The machine learning technology is also critical to weeding out the junk mail. Approximately 90% of the emails received by Yahoo Mail clients are generated by machines, and 80% of them are spam.

Yahoo runs its own servers—it’s not an Amazon Web Services, Google Compute Cluster, or Microsoft Azure cloud client, thank-you-very-much. That means that each day it must move mail among its various global data centers by itself. To keep the Yahoo Mail running between data centers, the company relies on its own version of the Pony Express: the Data Highway Rainbow.

According to Yahoo engineer Saurabh Dixit, the Data Highway Rainbow disseminates upwards of 800TB of data represented in 250 billion user events across the various data centers that Yahoo has placed around the world.

Related Items:

Hortonworks Shifts Focus to Streaming Analytics

IBM Throws In with Hortonworks’ Hadoop