Follow Datanami:
November 4, 2015

Skip the Ph.D and Learn Spark, Data Science Salary Survey Says

Prospective data scientists can boost their salary more by learning Apache Spark and its tied-at-the-hip language Scala than obtaining a Ph.D., a recent data science survey by O’Reilly suggests.

In its 2015 Data Science Salary Survey, O’Reilly found strong correlations between those who used Apache Spark and Scala, and those who were paid more money. In one of its models, using Spark added more than $11,000 to the median salary, while Scala had about a $4,000 impact to the bottom line.

“It is no surprise that Spark is the tool with the greatest coefficient,” O’Reilly says in its report. “If we indulge in a possible violation of assuming cause and effect, learning Spark could apparently have more of an impact on salary than getting a PhD. Scala is another bonus: those who use both are expected to earn over $15,000 more than an otherwise equivalent data professional.”

Spark and Scala are tied at the hip, apparently. “It appears that in the data space, despite its suitability for a variety of applications, the Scala language has become inextricable with Spark,” O’Reilly says. “In comparison, while Java remains in the open source cluster with Python and Spark, its usage declined from 2014 according to the survey data.”

O’Reilly compiled very detailed job data from more than 600 respondents. The median annual base salary among its survey respondents was $91,000, while for US respondents it jumped to $104,000 (the survey was global). That is not significantly higher than last year, the company says.

But O’Reilly went much deeper than this, and built several models (using the affinity propagation algorithm in Scikit-Learn) that allow you to slice and dice how occupational variables impact salaries earned by people in the data science business (not all of whom have the title “data scientist,” of course).

Fig 1: The standard distribution of salaries of data scientists (courtesy O'Reilly)

Fig 1: The standard distribution of salaries of data scientists (courtesy O’Reilly)

Starting with a base salary of $70,577, O’Reilly figured out how the various characteristics–such as age, gender, education, company size, industry, geographic location, technical skills, daily tasks, and tool usage–impacts the end salary.

Some of the findings were not surprising. For example, each year of age past 18 adds nearly $1,500 to the salary. Working at larger company adds $400 to your salary, while being male (add $8,026), working in California (add $16,000) or the Northeast (add $12,000), and having a Ph.D (add $7,500) also work in your favor (or not).

The survey pointed to some interesting trends in the data. Besides being a man, having a degree, o living in a particular place, it turns out that what you make is tied closely to what tools and technology you know and use, and how much time you spend doing various tasks.

Some of the findings are non-intuitive, which is often the best kind. For example, if you spend one to four hours per week on basic exploratory analysis, you can expect your salary to be about $4,600 higher than if you didn’t. However, if you spend half of your time (4+ hours) doing basic exploratory analysis, you can expect your salary to drop by $6,609.

You may loathe data cleansing and ETL work, but according to O’Reilly, if you spend more than four hours per day on ETL, you can expect a salary of around $123,000 per. (Note to self: remind the boss how much you love ETL.) Similarly, those who spend more than four hours per day in meetings can expect to have a median salary that is more than $11,000 higher than those who don’t. (Self: more meetings, please.)

Other winners in the tech/tool category include the open source visualization library D3, which boosted the median salary by almost $8,000. Apache Hadoop pulled a $1,400 positive impact. Having a job that requires Visual Basic skills actually hurts your median salary by about $3,200.

As you can see in figure 2, there are several other technologies and tools that will give your salary a positive pop. Proficiency in Amazon’s Elastic MapReduce (EMR) correlated with a median salary of greater than $110,000, just below what putting Teradata (NYSE: TDC) on your resume can do.

data science survey_4

Fig 2: Salary ranges tied to data science tools/technologies (image courtesy O’Reilly)

Other tools or technologies leading the class (i.e. those drawing a median salary that’s roughly in the upper 50th percentile in the report) include open source tools like Hadoop, Mongo Cassandra, HBase, Storm, Mahout, and Pig; Hadoop platform providers like Hortonworks (NASDAQ: HDP), MapR Technologies, and Cloudera; established data players like SAP (NYSE: SAP) HANA, Teradata Aster, Netezza, Cognos, Greenplum, and Splunk (NASDAQ: SPLK); BI tools like, Qlikview, Microstrategy, and Pentaho; graph analytic tools like Neo4J, and GraphLab (Dato); odd-sounding open source projects like VowpalWabbit; advanced analytic tools like KNIME and Mathematica (Matlab); and university projects like LIBSVM.

Here’s one big surprise: The language R didn’t fare particularly well, which was definitely not expected. The percentage of survey respondents who report using R fell from 57 in the 2014 survey to 52 this year. Spark, by comparison, grew 17 percent, while Scala grew 10 percent. (For what it’s worth, Hadoop’s use dropped from 19 percent to 13 percent, while Java dropped from 32 percent to 23 percent).

While R remains one of the four primary tools used in data science (along with SQL, Excel, and Python), the usage pattern around R is changing, O’Reilly says.

“R is a prime example of a tool that is bridging the divide between open source and proprietary tools,” the report says. We’re seeing more interest in R paid by big software companies, including Microsoft, which bought Revolution Analytics, and Teradata, which just added support for R in its eponymous data warehousing environment.

This is reflective of changing times for R. “[T]he open-source-only crowd might be finding they don’t need such a large selection of tools, that Spark and Python do the job just fine,” O’Reilly says. “The large number of R packages has often been cited as a key advantage of R over tools such as Python, but this is not the kind of advantage that is guaranteed to last: there is no reason why developers of other open source tools can’t gradually build on their own libraries to catch up.”

You can download your copy of the O’Reilly 2015 Data Science Salary Survey here.

Related Items:

Data Science Education Gets Stronger, But It’s Not There Yet

9 Must-Have Skills to Land Top Big Data Jobs in 2015

Employers Paying a Premium for Big Data Skills