February 13, 2017

A Platform Approach to Data Science Operationalization

Alex Woodie

(Boiko Y/Shutterstock)

Eduardo Ariño de la Rubia has been building quantitative teams for the past two decades, and knows first-hand how hard it can be to bring data science capabilities into a production setting. He understands there are multiple pitfalls waiting to trip up data scientists, from corrupted models and code deployment snafus to collaborating as a team and proving to the CFO that stuff actually works.

So when an advertisement for Domino Data Lab popped up on his Twitter feed, asking him if he wanted to run his models faster, it grabbed Ariño de la Rubia’s attention. After trying out the cloud-based data science platform for a few minutes, the then-Principal Data Scientist for the Ingram Content Group realized what he was looking at.

“I turned to my wife and said, ‘I’m so happy somebody finally built this thing,'” he tells Datanami. “I had built bad versions of it, because I had needed to.”

Ariño de la Rubia quickly became a Domino Data Lab customer, and used the company’s data science platform to help guide his team of data scientists to get access to data, to streamline access to quantitative tools, to track the development of their models, and generally to swing above their collective weight and help them be better data scientists. In fact, he liked the product so much that he decided to join the company, and today he advocates for the product as its Chief Data Scientist.

While Ariño de la Rubia is partial to Domino now, he realizes that the emerging class of products known as data science platforms is addressing a giant need in the market today, which is to serve as the connective tissue for the myriad of tools data scientists want to use today, and to provide a general structure to the workflow.

Quantitative Roots

Domino was founded in San Francisco by two former hedge fund quants, CEO Nick Elprin and CTO Chris Yang, who realized there was a fertile gap emerging in the data science space for someone who could bring all the tools and techniques together, while respecting the data scientists’ unique talents and point of view.

“What they learned,” Ariño de la Rubia says, “was there are a couple of truths. Collaboration is hard. Building a shared context is hard. Top down mandates to quantitative researchers just don’t work. You can’t enforce a new tool or programming language on high powered folks who have Ph.D.s in condensed matter physics. You can’t suddenly just tell them, ‘Well you can’t use Matlab anymore because we bought this random doohickey.'”

The rapid proliferation of advanced analytic and statistical libraries in the open source domain is giving people alternatives to the Mathworks, SASs, and IBM SPSSs of the world. No longer confined to a proprietary vendor’s analytic platform, data scientists are free to experiment and share and use what works best for them. But as powerful as this flowering of open data science is, the freedom comes at a price.

“They don’t come together and play well nicely,” Ariño de la Rubia says. “The Jupyter notebooks and R Studios of the world or these advanced libraries like PyMC3… They’re incredible tools, but as an organization it’s really hard to leverage these incredible tools.”

Reproduce Your Work

They key aspect to Domino’s platform is the reproducibility engine, which should resonate with people with hard science backgrounds. “It encapsulates every experiment you run,” Ariño de la Rubia says. “So if, a year down the road, your original data scientist has left and a new data scientist needs to update a model, they don’t have to go through some bizarre process where they figure out what incantation of libraries and dependencies needs to be cobbled together to get an experiment to run.”

Domino runs Docker under the covers as an abstraction layer and a buffer between the data scientists and the models. “You just click a button and the Docker container spins up and installs all the appropriate dependencies,” Ariño de la Rubia says. “From the ground up, you can re-execute the analysis and actually build on the previous work instead of what usually happens in an organization, which is just spinning your wheels and rewriting the same recommender over and over again.”

While much of the cutting-edge work today is being done with R and Python libraries, or machine learning algorithms in Apache Spark or H2O, the company is agnostic to the actual tools being used. Some customers use Domino to track work in Matlab, while others employ older languages and coding environments.

“I talked to a gentleman earlier this week whose coworkers is writing Fortran code and he’s using TensorFlow to invent a new type of clustering in the same exact reproducible environment inside of Domino,” Ariño de la Rubia says. “That’s really powerful, to bring together these different numerical and statistical computing practices all under one umbrella, without restricting these practitioners from being able to use the tools that they’re excited about using.”

This approach of combining code reusability and tool flexibility has resonated with the market, and Domino has a number of blue chip names, like Allstate, Clorox, Monsanto, and Zurich, on its customer roster. The software was originally designed to run in the cloud, but was ported to run on premise at the request of Domino’s very first customer.

Anything that runs on Linux or Unix can be brought under the Domino umbrella. So if a customer wants to develop a statistical model in Python, and then execute it atop a Hadoop cluster, that can work. However, many of Domino’s customers are running their machine learning scoring workloads on the same Domino cluster that they’re training the models on. For these clients, the capability to expose models via a simple API helps to simplify the data science experience.

Tracking Change in Data Science

Keeping up with change is one of the hardest parts of becoming a data-driven organization. Models must be retrained to account for new data. Old assumption must be revisited to check for accuracy. And data scientists themselves constantly churn in and out of organizations in search of richer data and greener pastures.

Retaining a semblance of normalcy amid all this change is one of the tasks that Domino (and other data science platforms like it) has taken upon itself to help deliver. This is particularly important when you realize that the Wild Wild West of big data analytics will eventually end—and in fact is already ending for companies that have European customers and must comply with the GPDR.

“What we discovered is, as data science becomes central to businesses, the team gets bigger,” Ariño de la Rubia says. “You have regulators, model validators, people who want to be able to look into the entire process, and to actually sign off and understand and provide subject matter expertise.”

As more decisions are made in a “fully algorithmic fashion,” companies will need tools to prove that the models didn’t discriminate against people, he says. Companies will eventually need “to trace that whole model provenance back to the original data set, the entire experimentation process to deployment, so that either regulators or even your customers can understand, why was this decision made?”

“Our argument is straight forward and I think it resonates,” Ariño de la Rubia says. “If analytics is core to your business, then you need a system of record for your analytical workload.”

There’s no doubt that many organizations – perhaps most of them — want to become “data driven” organizations. But closing the gap between this desire and successfully leveraging modern data science is no easy task. For organizations on the outside of the data bubble looking in, data science platforms like Domino’s present a compelling path to ease the burden of data science operationalization.

Big Data Fabrics Emerge to Ease Hadoop Pain

Why You Need a Data Science Platform

Applications: Artificial Intelligence

Technologies: Middleware

Sectors: Financial Services, Healthcare, Manufacturing, Retail

Vendors: Domino Data Lab, H2O

Tags: big data, Data Analytics, data science platform, Domino Data Lab, h2o, Hadoop, machine learning, python, R, Spark