Databricks Offers Something for Everybody with AutoML Solution
Databricks today took the covers off a new automated machine learning solution that promises to reduce the amount of manual coding required to develop predictive applications. But while other AutoML solutions tend to focus on core data science aspects of predictive apps, like model selection and hyperparameter tuning, Databricks new offering is designed to be used by a broad swath of personas to automate a range of data science and engineering activities, from data prep to production deployment.
Getting predictive applications from development into production requires the cooperation of several types of users, all of which have a role within Databricks’ new AutoML solution, according to Bharath Gowda vice president of product marketing for the San Francisco-based company.
“What we see in the marketplace is there are lots of different kinds of people with different skills that are actually needed to take data and to turn it into predictive models in production,” Gowda says, “We see data engineers playing a critical role in the data prep piece of it. There are ML engineers who are a lot more sophisticated on the other end of the spectrum, and software engineers who also know ML who are deeply involved with productionalizing the systems.
“But there’s also the emerging class of the citizen data scientist, if you will,” he continues. “These are people who either know the domain really well or they come from a software background and learning the business piece of it. It’s usually all these people who are super involved in the end-to-end lifecycle to really create value for their customers.”
The Databricks AutoML solution presents different interfaces and capabilities to different users depending on their comfort levels with technologies and the degree they want to get their hands dirty with code or low-level concepts, Gowda says.
First there are the folks who want entirely codeless environments who want to see GUIs and have things automated to the greatest extent impossible. Next there are those who want the AutoML solution to automate most of the solution, but allow the users to go in and tweak the code a little bit. Lastly there are the expert ML engineers who want access to every knob imaginable so they have full control over every aspect of the application.
“So we give all these different people a common workspace where they can play at any different [level],” he says. “You can have citizen data scientist at the highest level, not a coding environment at all, but they can go in and choose the features they want to work with, do the model search and all that, and very quickly know if they’re in the ballpark. And then from there, they can hand the same environment over to the data scientists and ML engineers, and they can go within it and begin to fine tune the hyper parameters.”
The new AutoML capabilities were added to the Unified Analytics Platform, which is the name of Databricks’ cloud-based collection of data science and engineering tools developed around Apache Spark. There is no additional SKU added to the Databricks product catalog. Every customer who subscribes to the hosted data science and engineering solution get AutoML as part of the bundle.
The new pieces that enable the AutoML functionality exist in several areas of the offering. That includes MLFlow, the open source project that Databricks unveiled a year ago to help track the development and lineage of machine learning models. It also uses Delta, the new big data staging zone that Databricks unveiled nearly two years ago to help address the challenging data quality problems plaguing customers adopting the new generation of big data platforms.
Much of the new capabilities that Databricks built AutoML are packaged up in a product called Machine Learning Runtime, Gowda says. “But really the end-to-end offering is a union of all the tools that exist around the past from Delta to MLFlow to the ML Runtime.”
In addition to the ML Runtime, there are several other components to the AutoML launch. One of those is AutoML Toolkit, an offering available from Databricks Labs that automates the entire machine learning pipeline, from feature engineering to deployment. AutoML Toolkit appliations are automatically tracked in MLflow.
There’s also the Automted Model Search, which gives the customer a way to find the best model for a particular piece of data. If the user already knows what model they want to use, they can use Autoamted Hyperparameter Turning, which turns the weights and knobs in the model to get the best performance out of it.
Clemens Mewald, Databricks’ director of product management for machine learning and data science, compares the different levels of automation available in the AutoML offering to three types of car drivers.
“At the bottom, at the most expert level, are resort scientists or machine learning engineers, who have full knowledge of all the [options] that exist and always want the latest and the greatest. They want a stick shift. It may not be the most convenient and most efficient, but they want that full control,” he explains to Datanami.
“Others are data engineers and software engineers who are not super familiar with ML and various modeling techniques,” he continues. They want to go up from stick shift to an automatic. They want the most common tasks taken care of for them. They can still write code, but Databricks automate a lot of the work for them.
“At the top level is the driverless car,” Mewald continues. “You want to give up all the controls and have an automated system, have an outcome….In some cases, in some environment, they work beautifully well. But in many cases, you actually have to intervene and take over some control. At that level, the product that’s most closest to this is the Auto ML Toolkit, that’s really automating everything for you. But if you want to regain control and go a whole lot deeper you can.”
All of the software behind Databricks’ AutoML solution is open source, giving customers freedom to deploy it how they like, the company says.