Solving the ‘Last Mile’ Problem in Data Science
There’s a ton of innovation occurring within the realm of data science at the moment thanks to a blossoming of machine learning technologies and techniques. But much of that innovation isn’t getting into production because of an impedance mismatch between data science and IT. Now a company called Open Data Group is aiming to close that gap with a Docker-based model deployment framework.
Open Data Group‘s CTO Stu Bailey describes the company’s FastScore framework as an abstraction layer that makes it easier for enterprise IT professionals to deploy data science models into production environments.
“We’re exclusively focused on bridging analytic professionals, data scientists, quants, model builders, analytic engineers – whatever you want to call them — and IT for deploying analytic models,” he says. “Our solution focuses on getting models deployed as durable, cloud portable assets that will have a very long life time, but can be easily changed, easily migrated.”
FastScore doesn’t care what language or environments the analytic model is developed in. It supports models developed in Python, R, SAS, H2O.ai, and Juypter and Apache Zeppelin data science notebook, among others. Once the model is created and logged in its Avro schema, FastScore can transform it into a Docker microservice that can be called with a REST API, which IT professionals are familiar with.
“We’re exclusively focused on deployment,” Bailey says. “We have no agenda in how model building is done. We spend quite a bit of time integrating with model building tools, but our focus is building a very clear abstraction for handing off models from the data lab or the data science process into pre-production and production environments.”
The company is just as neutral when it comes to production environments as it is for data science development environments. Users can use whatever scheduling system they want, including Kubernetes, DC/OS, or CloudFoundry, while it supports data stores like S3, HDFS, and relational databases.
Creating an abstraction layer between data science and IT lets the data scientist department move as quickly as they want, while protecting the systems administrators, network administrators, systems analysts, and ultimately the CIO from getting too involved with the day-to-day management of machine learning models.
Open Data Group recently dealt with a large manufacturer that was struggling to get machine learning models into production. “They had 40 awesome models, but they hadn’t been deployed,” said Bailey. “Why not? Because of this impedance mismatch between the newness of machine learning, and the general risk profile of IT versus data science.”
In addition to bundling machine learning models as consumable Docker microservices, the company also tends to the daily care and feeding of the models, which it refers to as AnalyticsOps. It offers hooks into code repositories like GitHub, model management functionality, and AB testing capabilities for comparing the effectiveness of models.
“We have a simple set of abstractions, a very consumable technology stack that really makes the data science much more productive, but it’s built like IT would expect it to be built, in a modular way that’s future proof for their own journey,” Bailey says. “If I want to move a prediction function for a GLM that I built in scikit-learn and deploy it to AWS and plumb it into a Kafka stream, that should be easy. And then if I want to move it to Google, that should be easy too. And I should have a very high degree of confidence that all the math is absolutely going to be the same.”
While it’s new to the big data software scene, Open Data Group has been in the machine learning and applied statistics business for 18 years, and has quite a bit of experience in helping customers get real benefits out of their data.
The Chicago-based company was founded by Robert Grossman, the well-respected inventor of predictive model market language (PMML) and a professor at the University of Chicago, and recently switched from a services-first model into developing software. Now it’s taking that experience and bundling it up in such a way as to enable customers to be creative with data science and use the latest advances, yet protect the IT department from being caught up in the explosion of innovation.
Bailey, who taught at University of Chicago with Grossman before co-founding Infoblox with Pete Foley, who is Open Data Group’s CEO, says the timing is right for companies to get more serious about data science and how they can use it to improve their business.
“I would have loved to have started a data science company when I started Infoblox but it was too early, in my estimation,” he says. “We’re seeing a transformation of large portions of the economy and industrial sector with data science in a somewhat analogous way to how computer science really started to impact very large portions of the economy in the 80s and 90s… It’s just the beginning of the journey.”