July 21, 2017

Taking the Data Scientist Out of Data Science

Alex Woodie

(Sergey Nivens/Shutterstock)

If you were a data scientist three years ago, you could pretty much write your own ticket. Everybody in the industry, it seemed, either wanted to hire a data scientist, or wanted to be one. But today, thanks to a confluence of factors, organizations are beginning to question whether they need these digital unicorns at all.

The key to understanding the dynamic at play here is to separate the activity of “data science” from the persona of “data scientists.” Organizations most definitely want to do data science to get insight from their data. But they’re finding that they don’t necessarily need a classically data scientist to get there.

The folks behind the MATLAB analytics platform have been watching this dynamic unfold within its customer base, which leans heavily toward fields like product engineering, manufacturing, and life sciences.

According to Seth DeLand, product marketing manager for data analytics group at MathWorks, giving these domain experts data science training often pays higher dividends than bringing a data scientist up to speed on the actual domain they’ll be working in.

Putting data science tools in the hands of domain experts could yield significant data science dividends (Likoper/Shutterstock)

“You educate [these domain experts] about statistics and machine learning and these types of techniques, and they can hit the ground running,” DeLand tells Datanami. “In fact, if you put a data scientist in those types of situations, the fact that they don’t have the domain expertise is often very costly upfront.”

The extra time and money required to train a data scientist in the specific domain may not be worth it, DeLand says. “Oftentimes it ends up being more efficient to put the tools of the data scientist in the hands of that domain expert,” he says.

A MATLAB user with a background in electrical engineering or mechanical engineering often will be able to pick up the statistics and machine learning concepts more quickly than getting a data scientist up to speed on a particular domain, DeLand says.

“The leap from that type of a person to go from doing their traditional work to applying data science tool is less of a leap than to go from having no experience in that area at all,” he says. “That’s the direction we see. These people have some skills set and have that domain expertise, which is really key.”

Low Hanging Fruits

The high level of hype surrounding the big data phenomenon over the past few years has driven many organizations to hire data scientists when they really didn’t need one, says Datameer CEO Stephan Groschupt.

“It’s very clear that a lot of companies realized that this hype of bringing in data science teams and building huge data science teams is not the solution for every problem,” Groschupt says in a recent video. “There are so many insights [you can get by] just bringing different streams of data together in the organization, and you don’t need a data scientist for [that].”

There’s plenty of low-hanging big data fruit to pick before the skills of a data scientists are required (retrorocket/Shutterstock)

There are data questions that highly trained data scientists are best equipped to answer. But there are many more questions that do not justify the considerable investment that it will take to assemble a full team of data scientists, Groschupt says.

“There’s so much value in just going after low hanging fruits, and to be honest, that’s the quantity of problems that companies see in front of them around data and analytics,” he says. “We saw that everybody was very excited and data scientists [were going] to solve every problem, including printing money and making the dishes. But I think more and more companies are understanding that there are very specific use case where data science teams are incredibly valuable, but not everything needs a data scientist.”

Better Data Science Training

The wider availability of data science training and education may also be contributing to the declining need for classically trained data scientists.

There is a wide variety of affordable online courses from companies like Udemy, Udacity, Coursera, and DataCamp that people can take to begin their machine learning education. You can get all the material in Andrew Ng’s 11-week machine learning class for free, but you’ll have to pony up some dough if you want a certificate that shows you completed it.

These courses can help folks who will never become data scientists learn some data science skills to help them do better in their jobs. This dynamic of a wider pool of workers with data science skills jibes with a recent report commissioned by IBM that finds a big shortage of people with data science and analytics (DSA) skills.

The “Quant Crunch” report found that, while data scientists and advanced data analyst positions will remain in high demand, there will also be demand for folks with titles like credit analyst, GIS specialist, or marketing analytics manager to have DSA skills. By 2020, the most in-demand job title will be “data-driven decision makers,” the report found.

Better Data Science Software

Increasingly, software is doing the jobs that we used to rely on data scientists to do. That’s the case at MathWorks, which is building into MATLAB new features for helping users with selecting machine learning algorithms, performing the feature engineering to select which variables will be used, and fine-tuning the algorithm for production use.

Data science software is lessening the need for data scientists

“One thing we’ve been focused on lately,” DeLand says, “is creating point-and-click applications in MATLAB that allows users to go through that workflow of trying a bunch of these machine learning models, preparing them and testing them, and really doing that first pass to figure out what’s the right ballpark I should be in for fitting a machine learning model to my data.”

Better software can also help big data practitioners avoid the sorts of statistical errors that previously required a highly trained data scientists to be able to spot.

“One really traditional problem in machine learning is overfitting,” DeLand says. “If you train the machine model and over fit it to the data you have, it’s just going to do a really poor job if you put it into production. Forcing people to opt into different types of data validation techniques that help protect against overfitting is something that we’ve built into our software.”

The advent of deep learning techniques is a mixed bag when it comes to its impact on demand for data scientists, according to DeLand. While deep learning eliminates the need to perform feature engineering, which is one of the major skills a data scientist would be called on for classic machine learning, the complexity is shifted to designing the neural network architectures.

“In deep learning, you don’t really have to worry about that feature engaging step because it’s all kind of absorbed into the model itself,” he says. “But what you end up spending your time in deep learning is really tuning the model architecture, so in terms of a time-savings, it’s really kind of a tradeoff.”

Data Science Gap

While Web giants like Google and Facebook have the resources to hire tens of thousands of data scientists and engineers to solve big data problems with machine learning and AI, the majority of Fortune 2000 companies are left out in the cold, says Ali Ghodsi, the CEO and co-founder of Databricks, the cloud computing company behind Apache Spark.

“There’s a huge gap,” says Ghodsi, who’s also an adjunct professor in the computer science department at UC Berkeley. “You don’t have those experts. You don’t have those PhD.s. You have domain experts.”

A looming ‘AI gap’ threatens to derail data science projects at Fortune 2000 firms, says Databricks’ Ghodsi

Ghodsi sees the potential for data science platforms, such as the cloud-based environment built by Databricks, to help companies equip domain experts with data science expertise, and fill in that gap.

“If they want to go hire 20,000 data scientists and have them stitch all the open source tools together and build the extra ones that are need and build some extra glue and get it all to fit end to end? They could,” he tells Datanami. “But how long is it going to take for them to do that, versus just buying [a data science platform] and focusing on solving the actual problem they want to solve, which are domain predictions problems.

“They should focus on those problems,” Ghodsi continues, “and not worry so much about downloading different software and stitching it together, and building more software and maintaining it, and just having more engineers doing that for you all day.”

For years we’ve been told that data scientists are required to do high-level big data analytics. But thanks to better software, better data science education, and the capability to cross-train domain experts with data science skills, it appears organizations have a clear path to progress their data science agendas without actual data scientists on board.

Why Big Data and Data Scientists Are Overrated

Overcoming the Quant Crunch

Applications: Artificial Intelligence, Data Mining

Technologies: Middleware

Sectors: Academia, Biosciences, Financial Services

Vendors: Databricks, Datameer, Facebook, google, IBM, Mathworks

Tags: AI, big data, data science, data scientist, databricks, datameer, machine learning, matlab, Spark