February 24, 2014

Top 10 Netflix Tips on Going Cloud-Native with Hadoop

Alex Woodie

Four years ago Netflix made the decision to move all of its data processing–everything from NoSQL and Hadoop to HR and billing–into the cloud. While going “cloud native” on Amazon Web Services hasn’t been without its challenges, the move has benefited Netflix in multiple and substantial ways. Here are 10 tips from Netflix on making the cloud work.

When Netflix’s senior management decided that all of its data processing would run in the cloud, not everybody was happy.  “I thought it was crazy,” says Kurt Brown, director, data platform at Netflix. “[I thought], We’re so good at Oracle, we’re so good at the data center thing. We’ve got world-class people doing it.”

Over time, Brown realized that the benefits of moving the big data infrastructure far outweighed the costs. “It just makes sense to go there. We’re going to be a big company and go global,” he said during a presentation at last week’s Hadoop Innovation Summit in San Diego, California. “Over time, I’ve totally drunk the Kool-Aid and I’m going to give you some Kool-Aid today. If you don’t want any Kool-Aid, then cover your ears.”

Here are 10 tips gleaned from Brown’s presentation on going cloud native with Hadoop:

  1. Show Me the Elasticity, Not the Money.  “Elasticity is a huge thing,” Brown says. It gets you away from the “Oh shoot bought too much hardware and spent too much money” situation on the one hand, and the “Oh shoot I didn’t I buy enough hardware and now I’m going to have a really bad customer experience” on the other. While Netflix reserves Amazon capacity for the vast majority of its S3 and EMR workloads, it’s able to spin up AWS capacity as needed to deal with both expected and unexpected spikes. It’s also able to sell its unused reservations, although it has actually never done that, Brown says. And while Amazon has saved Netflix “a ton of money,” that was not the primary driver to move to the cloud. “This is absolutely the wrong question to be asking, “Brown says. “It’s not that it’s not important. But it’s not the most important question for why you should be in the cloud.”
  2. Avoid Heavy Lifting: It takes some getting used to, but giving up control and letting Amazon do the heavy lifting can be liberating. “It’s all about focus. At the end of the day, a lot of this is just plumbing. And do you really want to build this plumbing? If somebody’s going to do something for you for free, take it,” Brown says. Netflix takes full advantage of Amazon services like Simplified Queue Service (SQS) for messaging, EMR for Hadoop, and S3 for storage, which has freed developers to focus on writing the best applications to build its business. When people fret about the lack of control in the cloud, Brown flips it around. “The opposite of lack of control is you control it. So now you’re on the hook for everything! Do you really want that at the end of the day? Or do you want just to deal with APIs and the services themselves, and not the hardware? It really comes down to undifferentiated heavy lifting.”
    Kurt Brown, director, data platform for Netflix
  3. Get Creative with Excess Capacity: “You want to leverage that excess capacity in between the peaks and troughs,” Brown says. Netflix does this by using the downtime to pre-compute algorithms, for batch processing, and for encoding videos at various bit rates. (While Amazon runs nearly all of Netflix’s big data applications, the company uses a separate content delivery network [CDN] hosted by the likes of Akamai and Limelight for delivering its videos, which accounts for one-third of all Internet traffic). “Anything you can think of, use that valley between the peaks and troughs. It’s hard to do efficiently, but it’s a way to maximize your investment,” Brown said.
  4. Be Flexible with S3 Performance Issues: If there is one downside to running on Amazon, it’s that performance can be inconsistent. It’s just a fact of life when it running on top of virtual machines in the cloud. “You just have to work around it,” Brown said. For example, Amazon has built timeouts into its EMR service, and when they started interfering with Netflix’s Hadoop jobs, the company devised “suicidal timeouts,” which are jobs that try to restart themselves for a certain period of time, and then kills itself when it fails.  “Yes [the performance concerns] exist, but you can mitigate it. And it’s not so bad–it’s getting better all the time.”
  5. Less is More with S3 Instances: Netflix has experimented with different AWS instance types, and found that having many of the smaller instance types is often better than having fewer but bigger instances. “At the end of the day, because we’re using S3 as a back-end data store, it was actually better for us to use these cheaper lightweight instances because we got more aggregate bandwidth to S3,” Brown said. The type of instance also matters to Netflix’s Hadoop installation, running on EMR. “It’s hard to create a Hadoop cluster out of this Frankenstein of different instances,” Brown said. “So in some cases, it actually makes sense to have fewer instance types, even if it’s a little less efficient for the specific application, because it’s easier to share.”
  6. Create Many Small Web Services: You’re not going to be happy running a massive Java application on AWS, because you run the risk of having your entire operation shut down over a minor technical glitch. Instead, break your service up into many individual programs. “You want small isolated services with clearly defined interfaces,” Brown said. “So now you have a huge proliferation of interfaces and small things. Aren’t they going to break? Yeah, they’re going to break. Some things break. You just have to be ready for that.”
  7. Break Your Own Stuff:  “You should poke at yourself and break it occasionally,” Brown said. “Otherwise, you start building a lot of technical depth, and when it goes bad it goes really bad.” Netflix uses a product called Chaos Monkey that randomly shoots down S3 instances.  It’s also looking to use the monkey’s older brothers, Chaos Gorilla, which simulates taking down an entire Amazon Availability Zone (AZ), and Chaos Kong, which takes down a whole region.
  8. Manage Your Cluster for Availability: Amazon has built a certain degree of availability into its system, but the fact remains that stuff will go wrong. For example, when a hosted Hadoop cluster starts exhibiting signs of a memory leak (not an infrequent occurrence, Brown says), the company spins up a whole new cluster, and starts migrating the workload over. “If it’s looking good, we’ll move more traffic over,” he said. “After an hour, kill the old cluster, and you’re on the new cluster.  This way you don’t have to deal with rolling upgrades and mismatches of versions.”
  9. Use a Different Cloud for Backup: One of the concerns about moving workloads to the cloud is the ephemeral nature of the business. What if Amazon just disappears one day? While such a situation is practically impossible, it is still within the realm of possibility. While Netflix is 100 percent dedicated to running all of its operations in Amazon, it does store its backups on Google’s cloud.
  10. Don’t Use HDFS: While Netflix is a big Hadoop user, it doesn’t use the Hadoop Distributed File System (HDFS) all that much. Instead, it uses S3 as the source and the destination for all Hadoop (MapReduce, Pig, or Hive) jobs. People might think Netflix is crazy for doing that, Brown says. And while Netflix does lose a little bit of performance from this configuration (since locality of data and processing power is one of the hallmarks of Hadoop and HDFS) the upside is that Netflix gets much greater data availability. “You get eleven 9′s of durability out of S3, which is a heck of a lot better than you’re going to get anything in the data center,” Brown says. “That’s good if you want to trust your data.” If a Hadoop job goes down, the data is safe in S3, and Netflix just spins up another Hadoop cluster to re-run the job. It also allows multiple Hadoop jobs to share the same data, and also includes a handy “soft delete” function that allows users to recover data.

Related Items:

Hitching a Big Data Ride on Amazon’s Cloud

What Can GPFS on Hadoop Do For You?

Rethinking Real-Time Hadoop