March 28, 2016

Finding Long-Term Solutions to the Data Scientist Shortage

Alex Woodie

As we learned in the first part of this series, the gap between demand for skilled data scientists and supply is driving salaries north of $200,000 in some areas of the country. If big data analytics is to be democratized, steps must be taken to ensure that this short-term misalignment doesn’t turn into a long-term problem. Here are several ways the data scientist shortage is being addressed.

Perhaps the most straightforward way to approach the data scientist shortage is simply to train more of them. To that end, universities around the country are ramping up data science graduate programs. In just a few short years, dozens of major universities–from Kennesaw State University and USC to NYU and University of Tennessee-Knoxville–have launched two-year PhD- and Master’s level programs, and interest among college graduates is reported to be high.

But as KSU Professor Jennifer Priestley noted, universities are running in their own unique set of problems as they build these programs. “We’ve talked about the talent shortage in the private sector, but the reality is we’ve got the same challenge in academia,” Priestley tells Datanami. “Most academic universities cannot pay the same salaries as the private sectors, so we are very much in competition for the private sector for the very people who will be closing the gap.”

Since data science PhDs are such a new thing, universities can’t expect to build their data science program by hiring the newly minted data science PhDs. That means universities are asking tenured computer science professors to “pivot” mid-career into data science, which makes many of them uncomfortable. And when KSU and other programs start cranking out the data science PhDs in a year or so–slowly at first, then picking up steam–universities will have a tough time keeping them within academia to help train the next generation of data scientists. “We don’t have the talent to create the talent that the private sector needs,” Priestley says.

Re-Delegating Data Science

Another approach to closing the skills gap is to treat data science as a team sport, and delegate some of the responsibilities that are often ascribed to data scientists to business people who are not data scientists. That’s the approach espoused by EMC‘s Bill Schmarzo in his new book “Big Data MBA,” which basically aims to teach business people to think like data scientists.Big Data MBA

As Schmarzo explains, the classic three-part Venn diagram of a data scientist that includes math, engineering, and domain knowledge should be redrawn, and domain expertise should be eliminated.

“Data scientists don’t need heavy domain expertise. They just need to know data science,” Schmarzo tells Datanami. “It’s nice to have an understanding of industry. But it’s hard to find data scientists who know the business as well as the business people. It’s kind of an insult to think you’re going to find a data scientist who understands mortgage risk underwriting better than the people who are running the business.”

Schmarzo wrote “Big Data MBA” as a textbook to use in the class he teaches at UCSF. But he envisions the principles being adopted outside the classroom, including big data engagements at EMC and the two-week “vision” workshops he puts on.

“First off, we explain to clients what data science is. Really simple: data science is identifying those variable and metrics that might be better predictors of performance,” he says. “Business people understand the business and variables and metrics and other sources they’d like to test out.  We allow them to brainstorm that, then the data scientists come in and tell you what variables are better predictors.”

There will always be a need for classically trained data scientists who are experts in math and statistics, and who can build and run predictive models that are accurate and work. But according to Schmarzo, businesses can accelerate their big data projects if they include regular business people in the process, and stop holding their breath waiting for a rare data scientist “unicorn” to come floating in out of the ether.

“I think that helps us to address that skills gap problem,” Schmarzo says. “The tricky part is getting business people to think like data scientists, which in many ways is envisioning the realm of what’s possible. They think they only get reports and dashboards. They don’t know they can bring in Zillow data to predict the value of a customer, or they can bring in building permit data to look at how traffic will be impacted. They’ve not ever thought about that before.”

Chew Your Own Food

Schmarzo’s bottom-up approach holds a lot of promise. If we were all thinking predictively instead of descriptively, we’d open ourselves up to many more potential uses for data science. Just finding what is possible with data science is often the hardest part in big data.

But some data scientists caution against breaking up the classic Venn diagram too much, particularly when it comes to engineering chops. “A lot of companies now try to take the analyst and put an engineer next to them and hope that it adds up,” says Chris McKinlay a senior data scientist at the Los Angeles-based data science consultancy Data Science. “Sometimes it does, but sometimes it doesn’t very well.”

Data scientists often spend up to 80 percent of their time doing the grunt work of ingesting, cleaning, and transforming the data for analysis. It’s tempting to have data engineers to do some of that work, but that approach could backfire.

data science venn diagram“We have a client with a very large data set and the data scientists would say ‘Carve it down to size, then we’ll look at it,’ and the data engineers would go away for a week and do that,” McKinlay says. Then the data scientists would come back and say ‘This doesn’t quite have everything I wanted’ or ‘Why is the feature aggregated this way?’ So it goes back over the fence. It goes so much faster now that data scientists can actually use Spark and figure out for themselves what they want.”

For many types of problems, there is no way to replicate data scientist and the combination of skills, experience, and insight that he or she brings to the job. “You can be classically trained in statistics or machine learning and have great insight, but without the ability to implement, you’re completely reliant on a whole team of people to sort of chew your food for you,” McKinlay says. “Conversely, if you’re able to chew your own food, you may not be able to implement the most detailed machine learning model. But you take something fairly simple and get 70 percent of the information out of the data fairly quickly. Often in machine learning, simple is better. Until there’s a solid reason to make it more complex, simple is good.”

Data Science hopes to tackle the data scientist shortage by training them in 12-week “bootcamp” style classes. The company’s DS12 class, which starts in June, aims to turn people who have knowledge of math and statistics and programming skills into data scientists who can tackle private industry’s’ toughest problems. There’s a big focus on using Spark to prepare and analyze data, in particular via Scala. “The thing we’re doing that no one else is doing, is we’re actually throwing you at the kinds of problems you have to solve in weeks, not just toy problems or problems that will fit on your computer,” McKinlay adds.

It’s Elementary

If big data analytics is here for the long term—and there’s every reason to think that it is—then we need to ramp up data science education in a big way. That means we should all be thinking predictively, as the Dean of Big Data Bill Schmarzo encourages. But it also means doing stuff now to prepare children to be the next-generation of data scientists and business people with data science awareness.

(CristinaMuraca/Shutterstock)

(CristinaMuraca/Shutterstock)

There’s one simple change that the American educational system could make to address this: teach programming.

“It could be they something they learn–JavaScript or Basic—or even Excel. Just exploring the idea of constructing logical programs that make the computer do something,” says Travis Oliphant, a data scientist and the CEO of Continuum Analytics. “It doesn’t have to be tons, but just a little bit. Everybody coming out of high school should have that.”

Oliphant, who used to teach at a university, also thinks linear algebra and probability theory should be taught in high school. Instead of exposing young minds to those ideas, kids are conditioned to be scared of math. “I’d actually be driving education in the schools differently, because you really have to start the pipeline,” he says. “I was an academic for years. I taught at university for a long time. I see the gaps. I know what they’re learning, and I know it’s not quite right.  It needs to be more practical.

KSU Professor Priestley echoes that sentiment. “What I tell my students, any student that wanders by my office and doesn’t know what to do with themselves, I tell them study math or computer science, learn how to program in something,” she says. “I get you don’t want to be a computer programmer for the rest of your life. But until you figure out what you want to do, you’ll never go hungry if you know Python, you’ll never go hungry if you know SAS.”

Related Items:

Tracking the Data Science Talent Gap

Share This