June 8, 2016

Apache Spark Adoption by the Numbers

Alex Woodie

(Robolab/Shutterstock)

It’s been about three years since Apache Spark burst onto the big data scene and became one of the hottest technologies on the planet. Judging by the numbers surrounding Spark’s adoption—including things like salaries, attendance, and committers–the future of Spark looks quite bright.

Wondering how much money Spark developers make? (We know you are!) According to Tech Overflow’s latest developer survey, Spark tied with Scala for the top-paying job in technology, with Spark developers in the U.S. earning an average of $125,000 per year.

It’s interesting to note that Spark and Scala were tied, considering that Spark was largely written in Scala, and that Scala remains the most popular language for programming Spark applications. From a global perspective, Spark was the fourth highest paying job, behind F#, Dart, and Cassandra.

In terms of sheer popularity, Spark had the second-biggest year-over-year increase (163.5%) in the number of votes among all developers on Stack Overflow. It was behind React, an up-and-coming JavaScript framework for developing apps on Facebook and Instagram, but ahead of other big data-related technologies like Cassandra, Python, MongoDB, R, Redis, and Hadoop.

Spark salaries lead among developers in the U.S.: (Source: Stack Overflow survey)

You’ve probably heard that Spark is the top open-source big data project, topping Apache Hadoop and other big data projects. In September 2015, Databricks released results from a survey showing that Spark had more than 600 contributors within the past year, which was nearly doubled from the previous year.

Today, there are over 1,000 contributors to Spark, Databricks executives said at this week’s Spark Summit conference in San Francisco. Databricks says there are 2,500 attendees at the event, which wraps up today. That’s a 5x increase compared to the first Spark Summit held in December 2013.

It’s safe to say that, if you’re a big data developer, you’re probably interested in Spark, among a number of other technologies. According to a recent survey by Syncsort, nearly 70% of respondents stated they’re most interested in Apache Spark, which surpassed interest in all other compute frameworks, including MapReduce, which had a 55% share of developer interest. This was a clear indication that Spark was moving from data science project into production-grade technology, the company said.

Spark is second only to React in terms of the increase in devloper votes at Stack Overflow

These findings were backed up by a recent report from Wikibon analyst George Gilbert, who concluded that Spark-based investments will capture 6% of total big data spending, growing to 37% by 2022. Considering the widespread adoption and support that Spark is getting from the big data industry, that prediction would seem to be solid.

While Spark is primarily considered to be a tool for data scientists to use, it appears the technology is reaching past the unicorns to bring mere mortals into the fold.

According to a recent Databricks survey, 60% of the people using Databricks Community Edition (DCE)–the company’s Spark-in-the-cloud environment that just became generally available yesterday after a four-month beta–are neither data scientists nor data engineers. What’s more, one out of four DCE users have never used Spark before, which shows how the technology is spreading to new users.

How are people using Spark? We got some insight into this question from a survey Databricks conducted in late 2015.

According to the survey, Spark SQL is the most popular component of Spark, with 69% of Spark users reporting that they use this feature. The second most popular Spark component is Dataframe (62%), followed by the combination of MLlib and GraphX (58%) and Spark Streaming, which tied MLlib+GraphX at 58%.

Spark adoption according to Databricks September 2015 survey

How do developers use Spark? The survey says Scala was the most popular Spark language, with 71% of programmers using that API. Python was the second most popular language, followed by SQL, Java, and R.

Where do you see Spark run? (Run, Spark, run!) According to the Databricks survey, 51% of Spark deployments are on the public cloud. Nearly half of survey respondents (48%) say they run Spark in standalone mode, while 40% say they run it on Hadoop’s resource scheduler, YARN.

Only 11% of users reported running Spark on Mesos, the distributed resource scheduler that forms the heart of BDAS, the Berkeley Data Analytics Stack that, like Spark, came out of Cal’s AMPLab. Matei Zaharia, the creator of Spark and co-founder and CTO of Databricks, also co-created Mesos.

Spark consistently has more than 100 contributors per month (Source: BlackDuck)

Spark can no longer be considered the new kid on the big data block. We’ve seen other frameworks, like Apache Flink and Apache Beam, emerge to provide competition to Spark. Some of these other frameworks are quite compelling, and may do certain things better than Spark, according to big data experts and benchmark tests.

But despite the competition, Spark remains the hottest technology in big data. That says a lot about the value that developers, engineers, and data scientists are getting out of the open source software, and it says a lot about where big data tech will go next.

Only registered users may comment. Register using the form below.

Check off newsletters you would like to receive*
- HPCwire
- EnterpriseTech
- Datanami
- Technology Conferences & Events
- Advanced Computing Job Bank
- Technology Product Showcase
Email*
Name*
First Last
Organization*
Job Function*
Industry*
Country*
City*
State*
Province*
- Please check here to receive valuable email offers from Datanami on behalf of our select partners.

April 26, 2024

April 25, 2024

April 24, 2024

Sponsored Partner Content

Get your Data AI Ready – Celebrate One Year of Deep Dish Data Virtual Series!

Supercharge Your Data Lake with Spark 3.3

Learn How to Build a Custom Chatbot Using a RAG Workflow in Minutes [Hands-on Demo]

Overcome ETL Bottlenecks with Metadata-driven Integration for the AI Era [Free Guide]

Gartner® Hype Cycle™ for Analytics and Business Intelligence 2023

The Art of Mastering Data Quality for AI and Analytics

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Top 6 Strategies for Reducing Data Warehouse Costs

Building an Operational Data Warehouse for Real-time Analytics

Sponsored Multimedia

The Power of DataOps: Bring Automation to Life
No Comments

Tactical Steps for Cloud Migration
No Comments

Immuta Data Access Platform
No Comments

Data Mesh: Fact or Fiction?
No Comments

Contributors

Featured Events

AI & Big Data Expo North America 2024

CDAO Canada Public Sector 2024

AI Hardware & Edge AI Summit Europe

AI Hardware & Edge AI Summit 2024

CDAO Government 2024

Apache Spark Adoption by the Numbers

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 26, 2024

April 25, 2024

April 24, 2024

Most Read Features

Most Read News In Brief

Most Read This Just In