June 22, 2015

Apache Spark: Now Offered on Amazon EMR

Alex Woodie

The number of places you can run Apache Spark increases by the week, and last week hosting giant Amazon Web Services announced that it’s now offering Apache Spark on its hosted Hadoop environment.

The addition of Spark will give Amazon Elastic MapReduce (EMR) customers access to another big data processing engine to run, in addition to the ones already running, including Hive, Pig, HBase, Presto, and Impala, among others.

While this is not the first time Apache Spark has graced a computer running in Amazon’s huge network of data centers, it is the first time that Amazon has pre-installed Spark and made it an easy-to-order option on its menu of computing services.

“Although many customers have previously been installing Spark using custom scripts, you can now launch an Amazon EMR cluster with Spark directly from the Amazon EMR Console, CLI [command line interface], or API,” Amazon’s senior product manager Jon Fritz writes in an Amazon Web Services blog post.

Fritz provided several examples of how existing EMR customers have used Spark (configuring it themselves, obviously, instead of using the new shrink-wrapped offering). Among the EMR customers already doing stuff with Spark are:

The Washington Post, which is “using Spark to power a recommendation engine to show additional content to their readers”
Yelp, which uses Spark’s machine learning library (MLlib) to increase the click-through rates of display advertisements
Hearst Corporation, which uses Spark Streaming “to quickly process clickstream data from over 200 web properties,” allowing them to “create a real-time view of article performance and trending topics”
And Krux, which uses Spark to process log data stored in Amazon S3 using EMRFS.

Spark is gaining momentum as a faster and easier-to-program replacement for MapReduce within Hadoop environments. While MapReduce was batch-oriented and could take hours or days to return answers, Spark functions as an in-memory framework and can work in batch, interactive, and streaming modes.

Fritz notes two main ways that Spark beats MapReduce. The first involves Spark’s use of a directed acyclic graph (DAG) execution engine, which gives it a more efficient query plan for data transformations. The second is its use of in-memory, fault-tolerant resilient distributed datasets (RDDs), which keeps intermediates, inputs, and outputs in memory instead of on disk.

“These two elements of functionality can result in better performance for certain workloads when compared to Hadoop MapReduce, which will force jobs into a sequential map-reduce framework and incurs an I/O cost from writing intermediates out to disk,” Fritz writes. “Spark’s performance enhancements are particularly applicable for iterative workloads, which are common in machine learning and low-latency querying use cases.”

Amazon doesn’t charge for the Spark software, and allows EMR customers to create Spark clusters on a variety of Amazon Elastic Compute Cloud (EC2) instance types. These clusters can access data stored on Amazon’s S3 object storage systems via the EMR File System (EMRFS), push logs to S3, and use EC2 Spot capacity, Fritz writes. The Spark setup also supports security features like identify and access management (IAM) roles, EC2 security groups, and S3 encryption.

This is a big deal for Amazon, which is by far the biggest provider of Hadoop in the world, with tens of thousands of customers–more than all the other distributors combined. Opening Spark to its massive user base will only increase the adoption of Spark and further cement its emerging role in the big data analytics ecosystem.

Apache Spark Continues to Spread Beyond Hadoop

Apache Spark Ecosystem Continues To Build

Applications: Predictive Analytics

Technologies: Cloud, Frameworks, Storage

Sectors: Financial Services, Manufacturing, Other, Retail

Vendors: Amazon

Tags: Amazon EMR, Amazon Web Services, apache spark

Only registered users may comment. Register using the form below.

Check off newsletters you would like to receive*
- HPCwire
- EnterpriseTech
- Datanami
- Technology Conferences & Events
- Advanced Computing Job Bank
- Technology Product Showcase
Email*
Name*
First Last
Organization*
Job Function*
Industry*
Country*
City*
State*
Province*
- Please check here to receive valuable email offers from Datanami on behalf of our select partners.

Apache Spark: Now Offered on Amazon EMR

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 25, 2024

April 24, 2024

April 23, 2024

Sponsored Partner Content

Get your Data AI Ready – Celebrate One Year of Deep Dish Data Virtual Series!

Supercharge Your Data Lake with Spark 3.3

Learn How to Build a Custom Chatbot Using a RAG Workflow in Minutes [Hands-on Demo]

Overcome ETL Bottlenecks with Metadata-driven Integration for the AI Era [Free Guide]

Gartner® Hype Cycle™ for Analytics and Business Intelligence 2023

The Art of Mastering Data Quality for AI and Analytics

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Top 6 Strategies for Reducing Data Warehouse Costs

Building an Operational Data Warehouse for Real-time Analytics

Sponsored Multimedia

The Power of DataOps: Bring Automation to Life
No Comments

Tactical Steps for Cloud Migration
No Comments

Immuta Data Access Platform
No Comments

Data Mesh: Fact or Fiction?
No Comments

Contributors

Featured Events

AI & Big Data Expo North America 2024

CDAO Canada Public Sector 2024

AI Hardware & Edge AI Summit Europe

AI Hardware & Edge AI Summit 2024

CDAO Government 2024

Apache Spark: Now Offered on Amazon EMR

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 25, 2024

April 24, 2024

April 23, 2024

Most Read Features

Most Read News In Brief

Most Read This Just In

Sponsored Partner Content

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Sponsored Multimedia

Contributors

Featured Events

Share

Copy short link