There’s good news if you’re for a job in data science in 2016 — the number of job openings in the field appears to be rising as companies look to leverage big data for competitive advantage. But actually landing a coveted data science job means having the right mix of skills, and you may be surprised to learn what skills are most in demand by employers.
The folks at CrowdFlower recently did an analysis of the 3,490 postings for data science jobs on LinkedIn, and sorted out the top 21 individual skills that appear most often. Some of the results were not earth-shattering — SQL topped the list, to nobody’s great surprise — while other results could be leading indicators on how the data science field is evolving.
As mentioned, SQL was the most commonly cited skill, and is a requirement in 57 percent of all LinkedIn job postings for data science. Hadoop came in at number two, with a solid 49 percent rating. This did not surprise Lukas Biewald, CEO and founder of CrowdFlower, the San Francisco company whose workers create high-quality reference data that data scientist use to train analytic models.
“It’s not surprising the two skills that are at the top are SQL and Hadoop, which are the technologies that actually store the data,” Biewald tells Datanami. “Every data scientist has to know how to get the data out. If you can’t get the data out in the first place, you can’t do anything.”
Some eyebrows were raised at Python being the third-most cited skill in data science job postings. When CrowdFlower surveyed data scientists last year about skills are most important, Python played second fiddle to R. But in this survey of job postings (which is arguably more forward-looking in its scope), Python was cited as a critical data science skill in 39 percent of the listings, compared to 32 percent for R.
Biewald has some ideas why more employers are looking for data scientists with Python skills compared to R. “One thing is Python’s toolset is getting better. You see a lot more statistical tools built on Python,” he says. “It’s also a recognition that data science isn’t just statistics.”
If data scientists spend 80 percent of their time cleaning and prepping data, and only 20 percent of time actually doing analysis, then that might explain Python’s sudden emergence.
“I think Python is the language of cleaning data and R is the language of doing analysis,” says Biewald, who previously led the Search Relevance Team at Yahoo before co-founding CrowdFlower. “As data science becomes more about cleaning, prepping and enriching data, you see Python becoming more and more important, because it’s definitely the best language for getting your language into a form where you can do the analysis.”
The fact that Java came in fourth place (with a 37 percent rating) was a little bit of a headscratcher, because Java, per se, is not necessarily a great language for data science. But its high placing does make sense when you consider that Hadoop was written in Java. Other Hadoop-related tools cracking the top 10 included Hive (31 percent), MapReduce (22 percent), and Pig (16 percent).
There were some notable omissions from the LinkedIn job-posting list compiled by CrowdFlower. Apache Spark, for all of its data science capability, was nowhere to be found. Neither was Scala, which (along with Python) is one of the primarily ways people manipulate data within the Spark framework.
It could be that Spark is still too green, and too little is known about Spark, to make it on the data science job description sheet. “There’s a lot of hype around it, but it might be too early,” Biewald says. “We’ve been experimenting with Spark at CrowdFlower. I think the technology is great, but there might be a lag before companies really start using it.”
Spark and Scala (which is heavily backed by Alphabet [NASDAQ: GOOGL] and is widely used at many of the high-tech darlings in the Silicon Valley) may well be the future of data science. But not every data science project or team needs to be on the bleeding edge of technology to enable their big data to bear fruit. “It’s amazing how many people are looking for data scientists right now [but] I think a lot of them don’t want to be on the cutting edge,” Biewald says.
The CrowdFlower list was populated with a number of well-known analytics tools, including SAS (appearing in 16 percent of listings), SPSS (10 percent), MatLab (10 percent), and Stata (3 percent). Biewald thinks these tools still have value and will continue to be used in the field for some time. But he expects them to gradually lose market share to newer tools that were designed from the outset with big data in mind.
Tools for cleaning and prepping data continue to gain momentum in the data science field (Westend61/Shutterstock.com)
“The data science role is bigger than being a statistician,” he says. “These older languages are built with more of a statistician in mind, with a small number of data points to run analysis, whereas with Hadoop and Python and Java–all the ones at the top [of the list]–are about running gigabytes of terabytes or data. You can get [SAS, SPSS, and MatLab] to run big analyses. But it’s not what they were designed to do.”
Not everybody agrees on the definition of “data science” or what “data scientists” should do and what skills they should have. In fact, some object to the use of the term “science” at all, preferring instead phrases like “applied statistics.” (Imagine the Harvard Business Review calling applied statistician “the sexiest job of the 21st Century.”)
But in the eyes of Biewald and others, the ability of a person to manage the data is just as important as the ability to run statistics accurately against a data set, and that’s his definition of data scientists going forward.
“In the past that wasn’t super hard [to do statistics] when it was thousands of records, but now that it’s billions of records, that takes real skill to get it in a format where you can do some regression or machine learning,” he says. “For that, I would want to hire a data scientist who has Python or C or Perl or Ruby or some language that’s made more for data processing rather than data analysis.”
What Does 2016 Mean for Data Science?
Is 2016 the Beginning of the End for Big Data?
9 Must-Have Skills to Land Top Big Data Jobs in 2015