Follow Datanami:
August 27, 2018

Five Ways to Unshackle Data Science from IT Now


If your company is struggling to get value out of its data science initiative, it’s not alone. More often than not, an organization will experience growing pains as it figures out how to gain data science competency through trial and error. But as Databricks CEO and co-founder Ali Ghodsi tells us, there are several ways companies can alleviate the pain and accelerate the data science transition, particularly as it relates to the IT department.

Here are five suggestions from Ghodsi for helping companies unshackle their data science teams from the IT department and begin delivering business value.

1. Engage with IT

Problem number one is that most companies are not organized properly to be effective with data science, particularly when a company has a strong IT culture. In many shops, data science and IT are at odds, with different mandates and different resources.

“The first problem we’re bumping into is organizations are just not built for this kind of machine learning,” Ghodsi says. “Basically IT owns the data. IT is responsible for making sure your data warehouse or Hadoop data lake is secure, reliable and cost efficient.”

Engaging your data science project with IT is a critical first step (REDPIXEL.PL/Shutterstock)

Data is what fuels machine learning and modern statistical approaches to artificial intelligence, and most big companies have tons of useful data that’s been collected over decades. But all too often, machine learning is hindered because the IT department is hesitant to give up control of the data.

“What often happens is, data engineering will say ‘Why do you need access to this data set? Fill out this form, we’ll look through it, we’ll get to it when we get to it,'” Ghodsi says. “Meanwhile, you need to iterate on the data set, because even once you get access to the data after waiting, you might figure out that it’s missing some data, you want to enrich it, so you’re back to square one.”

That basic tension is the source of inter-office drama that plays out every day in American companies that want to move ahead with data science initiatives. “These companies are just not organized for this,” Ghodsi says. “The potential is massive for AI, but they’re just organized the wrong way. They’re not organized so you can iterate on these data sets and get great results.”

2. Hire a CDO

The tension between data science and IT is evident in almost every organization that Ghodsi has worked with. The problem has to do with the organizational structure of existing corporations, and how they tend to compartmentalize responsibilities.

“You show me an organization, and I will tell you if they have the problem,” Ghodsi says. “All I need to do is look at the org chart and the titles and will tell you if they have this problem or not. I don’t need more than that, and it’s almost always spot on.”

Having a CDO can help push through headwinds slowing data science (Monkey Business Images/Shutterstock)

One of the first steps that companies can take to end the bureaucratic logjam, Ghodsi says, is to create a chief data officer position. The companies that have hired CDOs don’t suffer from the bureaucratic slowdowns compared to those with no CDOs, he says.

“We often advise you get a CDO that can oversee all of this,” he says. “Lots of organizations need a CDO that can centralize it and reduce the friction.”

Building a CDO position that has reach into IT and the data science initiatives inside the lines of business can also help to forestall other moves that companies often make to alleviate the logjam, including letting LOB set up their own data science projects on the cloud that’s outside of IT’s purview, or having IT take charge of data science.

“We see this sort of tension between the two to build up data science end to end from A to Z,” Ghodsi says.

3. Unified Platform

What tools and technologies are used to conduct data science also impacts how the data science interacts with IT. Depending on how an organization approaches this, it can either be a help or a hindrance.

Ghodsi recalls one Databricks customer that did all of its data science prototyping in Python, only to have the IT team re-implement the models in C++ for production. It’s a fairly common occurrence, but it can lead to problems. “When they did, the results they would get were slightly different from the data science team predictions,” he says. “Even though they tried as well as they could, they ended up with slightly different results.”

A unified platform for data science and engineering can help eliminate differences (James Jones Jr./Shutterstock)

Data science teams favor software that gives them the flexibility to iterate and try new approaches. Perhaps one model is best implemented in TensorFlow, while another problem really calls for Scikit-learn. All that trial and error and sampling of different modeling approaches is core to data science, but it runs counter to the typical IT mandate, which favors sticking with one system for years, preferably under an enterprise agreement.

Ghodsi says these challenges can be countered by using an analytics platform that serves the needs of the data science and IT teams in one place. Such a platform can enable data scientists to work in the statistically oriented languages they’re familiar with, such as Python, R, and SAS, while also allowing the engineers to work within their favored languages, like Java, Scala, SQL, and C++.

“The goal is unified analytics,” Ghodsi says. “Because if they can’t collaborate, then data engineering and data science will be partitioned and siloed. And if data engineering is siloed from data science — that is, machine learning — then you can’t actually have any AI going on in your organization. That’s the problem we’re trying to tackle.”

Obviously, Databricks sells such a product – in fact it’s called the Databricks Unified Analytics. But the lesson can be applied to other products too.

4. Build a COE

Another way to diffuse the tension between IT and data science teams is to build a center of excellence (COE) or matrix organization with more than one reporting line. This can help to accelerate collaboration between IT and the data science.

COEs can help bring data science and IT together, says Databricks CEO Ali Ghodsi

“The [unified] platform will not be enough by itself,” Ghodsi says. “We often advise you get a CDO that can oversee all of this. And if you don’t want to reorganize your 30 year-old organization that hasn’t been built for machine learning, . then our advice is maybe come up with a center of excellence or a matrix organization, some place where people can start talking. If you’re going to have it siloed away in different parts of the organization, then data is just going to get stuck.”

There are lots of vendors and services firms that can help companies set up a data science COE and begin with the first data science project. Matrix organization may be a little more rare when it comes to data science projects, but it’s another approach that could work for you.

5. Third-Party Data

Worst comes to worst, if you can’t get access to your company’s own internal data, you can always set up shop with external, third-party data. “It’s super paradoxical but sometimes it’s easier actually to buy a dataset from the outside than to get access to it internally in your organization,” Ghodsi says.

Data you buy from brokers is going to be anonymized and scrubbed of any personally identifiable information, so it won’t have all the useful details that data from your own organization will have, but that doesn’t mean it’s not useful.

In fact, many small startups that are processing massive amounts of data are doing so because they’ve either purchased third-party data sets or they’ve figured out how to scrape the Web for useful information, Ghodsi says.

Don’t overlook the potential that third-party data can bring (cybrain/Shutterstock)

“They’re extremely data driven and they just have one sole purpose, which is do machine learning to get better insight than the big old enterprises,” he says. “You can mine a lot of stuff from the Web. If you know how to get information out of it and structure it, there’s a lot you can do.”

There are no simple answers for getting data science to work better with the IT department. Both groups have different mandates and resources from their corporate masters, and are unlikely to budge much if confronted directly. But by taking a wider view of the challenges and possible solutions, there can be a path charted for data science projects.

“There are going to be a lot of exciting changes” because of AI, Ghodsi says. “It’s super exiting.    Just unfortunately right now, we have to go through some hard times with the way large enterprises are structured.”

Related Items:

Empowering Citizen Data Science

Exposing AI’s 1% Problem

Why Developers Need to Think like Data Scientists