Apache Spark: Now Offered on Amazon EMR
The number of places you can run Apache Spark increases by the week, and last week hosting giant Amazon Web Services announced that it’s now offering Apache Spark on its hosted Hadoop environment.
The addition of Spark will give Amazon Elastic MapReduce (EMR) customers access to another big data processing engine to run, in addition to the ones already running, including Hive, Pig, HBase, Presto, and Impala, among others.
While this is not the first time Apache Spark has graced a computer running in Amazon’s huge network of data centers, it is the first time that Amazon has pre-installed Spark and made it an easy-to-order option on its menu of computing services.
“Although many customers have previously been installing Spark using custom scripts, you can now launch an Amazon EMR cluster with Spark directly from the Amazon EMR Console, CLI [command line interface], or API,” Amazon’s senior product manager Jon Fritz writes in an Amazon Web Services blog post.
Fritz provided several examples of how existing EMR customers have used Spark (configuring it themselves, obviously, instead of using the new shrink-wrapped offering). Among the EMR customers already doing stuff with Spark are:
- The Washington Post, which is “using Spark to power a recommendation engine to show additional content to their readers”
- Yelp, which uses Spark’s machine learning library (MLlib) to increase the click-through rates of display advertisements
- Hearst Corporation, which uses Spark Streaming “to quickly process clickstream data from over 200 web properties,” allowing them to “create a real-time view of article performance and trending topics”
- And Krux, which uses Spark to process log data stored in Amazon S3 using EMRFS.
Spark is gaining momentum as a faster and easier-to-program replacement for MapReduce within Hadoop environments. While MapReduce was batch-oriented and could take hours or days to return answers, Spark functions as an in-memory framework and can work in batch, interactive, and streaming modes.
Fritz notes two main ways that Spark beats MapReduce. The first involves Spark’s use of a directed acyclic graph (DAG) execution engine, which gives it a more efficient query plan for data transformations. The second is its use of in-memory, fault-tolerant resilient distributed datasets (RDDs), which keeps intermediates, inputs, and outputs in memory instead of on disk.
“These two elements of functionality can result in better performance for certain workloads when compared to Hadoop MapReduce, which will force jobs into a sequential map-reduce framework and incurs an I/O cost from writing intermediates out to disk,” Fritz writes. “Spark’s performance enhancements are particularly applicable for iterative workloads, which are common in machine learning and low-latency querying use cases.”
Amazon doesn’t charge for the Spark software, and allows EMR customers to create Spark clusters on a variety of Amazon Elastic Compute Cloud (EC2) instance types. These clusters can access data stored on Amazon’s S3 object storage systems via the EMR File System (EMRFS), push logs to S3, and use EC2 Spot capacity, Fritz writes. The Spark setup also supports security features like identify and access management (IAM) roles, EC2 security groups, and S3 encryption.
This is a big deal for Amazon, which is by far the biggest provider of Hadoop in the world, with tens of thousands of customers–more than all the other distributors combined. Opening Spark to its massive user base will only increase the adoption of Spark and further cement its emerging role in the big data analytics ecosystem.