How To Build a Data Science Team Now
Business execs who are leading their companies down the data science track may be dismayed by the difficulty and expense of hiring a data scientist, the so-called “unicorns” who command quarter-million-dollar salaries. But fear not: While companies can benefit from having a full-fledged data scientist on staff, it is by no means a requirement to actually doing data science.
The team-approach to data science started soon after Harvard Business Review named data scientist the “sexiest job of the 21st century” back in 2012, spurring a run on data scientists, applied mathematicians, and other quantitative types that still hasn’t let up yet. Thanks to the continued rapid evolution of technology – not to mention workplace workarounds put into place due to the aforementioned unicorn shortage – the team approach has grown in popularity.
One business leader with real-world experience putting together data science teams (with and without actual data scientists) is Amy O’Connor, who built Nokia’s first data lake and is currently Cloudera‘s Chief Data and Information Officer.
O’Connor has a fascinating job that spans multiple roles. She takes responsibility not only for Cloudera’s internal IT systems, many of which are based on SaaS applications from vendors like Salesforce.com that are also Cloudera customers, but also overseeing the company’s own big data activities, which includes collecting and crunching a range of data to maximize customer satisfaction and revenue, and minimize risk – which, naturally, run on Cloudera’s own Hadoop-based data management and data science products.
While O’Connor has actual Cloudera staff data scientists available to her – not to mention the smart folks inside its Fast Forward Labs subsidiary – she doesn’t believe that data scientists are actually necessary in all circumstances. In fact, judging from the scarcity of data scientists, O’Connor says another path often preferable for many of the companies she deals with.
“One of the things I’ve found in my own experience at Nokia and Cloudera and with most of our customers is it helps to break down the role of a data scientist into smaller roles,” O’Connor tells Datanami in a recent interview. “I’ve found that that’s a way to take a set of individual people with skills set that complement each other and put them together to create what I call a hybrid data scientist.”
Instead of focusing on finding the magical unicorn data scientist who’s proficient across the triad of required skills – uber math and stats expertise, top-notch business acumen, and killer computer skills – businesses can actually get better results through the team approach. To pull it off, companies still need to find engineers who are good at programing distributed systems. They still need to find somebody with statistical skills, preferably at the Ph.D. level. And they still need somebody who’s a subject matter expert, preferably from the operations side of the house.
But provided that there’s enough overlap among them – that is, the statistician knows a bit of programming, and the engineer is passingly familiar with statistics – then the end result can be greater than the sum of its parts.
“That allows you almost to take that mythical unicorn of the data scientist, break it down into those three different roles, then strengthen the components of the roles that are needed to create a really good data science team,” O’Connor says. “We do this inside Cloudera. I’m finding that most of our customers are starting to do this as well.”
In an ideal world, there would be enough data scientists to satisfy demand. But thanks to the explosive growth of big data – as well as the new technology and business models that accompany it — there simply aren’t enough to go around.
However, as technology improves, it lessens the need for classically trained data scientists. According to Ben Lorica, the chief data scientist at O’Reilly Media, companies that are just getting started with data science can often work around the data scientist requirement by hiring a machine learning engineer.
“I think you can get started and get going by having a bunch of people who have the machine learning engineer title,” Lorica tells Datanami. “So they’re not quite data scientists. They’re more focused in getting machine learning out the door.”
Just as companies must build systems to collect and store data before they analyze it and implement the insights that flow from it, there’s an order to the hiring of the workers who build such systems, Lorica says. “For many organizations, the data engineer precedes the data scientist,” he says.
As their data science maturity increases, companies will want to revisit their staffing ratios. “Ultimately, you will need a team,” Lorica says. “Because there are so many opportunities to apply machine learning in a typical enterprise, you will need probably a team that has the skills to not only build and deploy these things, but also to interact with your business unit.”
There are many ways to approach staffing for data science and machine learning projects. Each individual person can play one of a number of roles, and companies can assemble different combinations of experts in different ratios. But ultimately, the team approach will be required – at least until the technology evolves considerably, Lorica says.
“Until we get to the point where data scientists can actually touch these production systems and push models out to production, build robust data pipelines, clean data, and have tools to share with each other,” Lorica says. “There are companies that are trying to start to build tools like that, where not only can they share data, but features.”
Citizen Data Scientists
As companies try to keep their personnel mix up to speed with the demands of big data initiatives and rapidly changing technology, we’re seeing the specter of another role emerging: citizen data scientist.
That’s the title that Trifacta and DataRobot are using to describe the folks who are using their respective products — Trifacta for data preparation and DataRobot with for automating the machine learning lifecycle — which the two companies are increasingly selling in a joint manner.
According to Jen Underwood, DataRobot’s director of product marketing, more than half of the people who interact with DataRobot are business intelligence analysts and BI professionals. “I was shocked by that,” Underwood says.
“It’s a spectrum of analytics roles,” she continues. “You have the BI pros, the business analysts, and then you have the really savvy folks from both of those segments now dabbling with citizen data science. I don’t know if they want to be called that, but that’s what we’re seeing.”
Wei Zheng, vice president of products for Trifacta, says the self-service data preparation work that used to consume up to 80% of data scientists time is now being done in a self-service manner by a group of folks, from data engineers to IT engineers to software developers.
“End to end, both products are aiming at trying to make this super simple and easy and lowering the tech skill bar to be able to take advantage of data,” Zheng says. “As time goes on, the industry is maturing and users maturing. I know in this industry, we like to bucket people into data scientist, data analyst, business analyst, data engineer buckets, etc. But in reality a lot of this is blended across a spectrum of skills.”
Exercising Team Builds
Amid the data science team building exercise, Cloudera’s O’Connor and O’Reilly’s Lorica both emphasize the importance of having members on the team who communicate data science capabilities with the business on the one hand, and who vet or audit data science approaches on the other. Those jobs aren’t necessarily in the data scientists’ skills wheelhouse.
As a company’s data science capability mature, they may find they do need the services of an actual data scientist or statistical expert. In some cases, it can be critical to select the right machine learning algorithm, or whether a deep learning approach will help. But for companies that are just starting up their data science initiatives, having such brainpower on staff could be a waste.
That a real data scientist isn’t actually needed to do data science comes at a great relief to companies, O’Connor says. “Across the board, with every one of the customers I’ve ever talked to, once we have this discussion and they realize they can break it down into separate roles… there’s this huge sigh of relief and they say, ‘Oh, ok, I know how to do this. I’m missing this one piece of the puzzle and I don’t need that mythical unicorn because I can’t find that person anyway.'”
Once the fixation on finding data scientists has ended, the company can get on with the process of building a data science team. “Not only is it cheaper and easier, it’s more effective,” O’Connor says. “In the end, you get a more comprehensive data science skills set inside your organization.”