Automating Development and Optimization of Machine Learning Models
You don’t need to be an expert in machine learning to know that it’s an exceptionally detail-oriented craft. Every step of the machine-learning development process—from preprocessing the data and engineering the feature model through building and evaluating the model—is intricate in its own right. As you connect these steps into an end-to-end workflow, the details and dependencies can grow unfathomably complex.
Designing and tuning a machine learning model is not for the faint of heart. If you’re a working data scientist, you must sort your way through a bewildering range of parameters in an attempt to get it right. For starters, you must select a feature model that contains the right set of independent variables to drive your intended machine-learning outcome. For another, you must sift through an unimaginably large set of possible machine-learning model architectures that incorporate your feature set. And there are myriad options for executing these models to achieve their intended outcomes, such making predictions, performing classifications, or recognizing some image or other phenomenon of interest.
Given the finite nature of their time and resources, data scientists cannot possibly explore every possible modeling alternative relevant to their latest machine-learning project. That’s why we hear a growing drumbeat of demands for what’s now being referred to as “automated machine learning.” Essentially, this is an emerging practice in which data scientists use machine learning tools to accelerate the process of developing, evaluating, and refining machine learning models. These tools automatically sort through a huge range of alternatives relevant to some machine learning task. The tools help data scientists to assess the comparative impacts of these options on model performance and accuracy. In the process, the tools recommend the best alternatives so that data scientists can focus their efforts on those rather than waste their time exploring options that are unlikely to pan out.
Here is a quick list of the core machine-learning modeling tasks that are amenable to automation:
- Automated data visualization: This accelerates the front-end process of exploring data prior to modeling, such as by automatically plotting all variables against the target variable being predicted through machine learning, and also computing summary statistics.
- Automated data preprocessing: This accelerates the process of readying training data for modeling by automating how categorical variables are encoded, missing values are imputed, and so on.
- Automated feature engineering: Once the modeling process has begun, this accelerates the process of exploring alternative feature representations of the data, such as by testing which best fit the training data.
- Automated algorithm selection: Once the feature engineering has been performed, this accelerates the process of identify, from diverse neural-net architectural options, the algorithms best suited to the data and the learning challenge at hand.
- Automated hyperparameter tuning: Once the algorithm has been selected, this accelerates the process of identifying optimal model hyperparameters such as the number of hidden layers, the model’s learning rate (adjustments made to backpropagated weights at each iteration); its regularization (adjustments that help models avoid overfitting), and so on.
- Automated model benchmarking: Once the training data, feature representation, algorithm, and hyperparameters are set, this accelerates the process, prior to deployment, of generating, evaluating, and identifying trade-offs among alternate candidate models that conform with all of that.
- Automated model diagnostics: Once models have been deployed, this accelerates the process of generating the learning curves, partial dependence plots, and other metrics that show how rapidly, accurately, and efficiently they achieved their target outcomes, such as making predictions or classifications on live data.
There aren’t many commercial offerings that automate these functions. This segment of the data science is very immature. Nevertheless, some tools are starting to emerge from labs and the open-source community that handle all or some of these automation tasks.
Most notably, Google recently announced its own initiative, AutoML, which, the vendor claims, “will be able to design and optimize a machine learning pipeline faster than 99 percent of the humans out there.” Fundamentally, Google’s approach:
- Leverages several algorithmic approaches: AutoML relies on Bayesian, evolutionary, regression, meta-learning, transfer learning, combinatorial optimization, and reinforcement learning algorithms.
- Hubs modeling automation around a controller node: AutoML uses a “controller” neural net to propose an initial “child” neural-net architecture that it trains on a specific validation data set.
- Iteratively refines machine learning architectures: AutoML develops, trains, and refines multilayered machine-learning model architectures in repeated iterative rounds. It may take 100s or 1000s of iterated rounds for the controller to learn which aspects of the machine-learning architectural space achieve high vs. low accuracy with respect to the assigned learning task on the training set.
- Guides model refinement through iterative feedback loops: AutoML’s controller neural net acquires feedback from the performance of the child model, with respect to a particular learning task, for guidance in generating a new child neural net to test in the next round.
In addition to Google’s initiative, other noteworthy automated machine learning tools include:
- TPOT: This Python tool uses genetic programming in conjunction with sci-kit learn to automate hyperparameter selection, algorithm selection, and feature engineering. It produces ready-to-run, standalone Python code for the best-performing model, in the form of a scikit-learn pipeline. Additional documentation is available And here’s an interview with TPOT’s lead developer.
- Auto-Sklearn: This tool, a drop-in replacement for a scikit-learn estimator, uses Bayesian optimization, meta-learning and ensemble construction to automate algorithm selection, hyperparameter tuning, and data preprocessing. Essentially, Auto-Sklearn builds a probabilistic model that captures the relationship between model settings and their performance, then uses the model’s Bayesian optimization features to select settings that trade off exploration of uncertain elements of candidate machine-learning model architectural space against exploitation of those parts of the space likely to perform well. For more info on it, check out this paper. And here’s an interview with the team that developed the tool.
- Machine-JS: This tool automates algorithm selection and feature engineering. It also advises the user as to whether their problem, given the amount of data available, is as yet solvable by machine learning.
- Auto-Weka: This tool uses Bayesian optimization to automate algorithm selection and hyperparameter tuning.
- Spearmint: This tool uses Bayesian optimization to automatically run experiments that iteratively adjusts model hyperparameters to achieve the machine-learning model’s objectives in as few runs as possible.
- Sequential Model-based Algorithm Configuration : This tool automated hyperparameter selection so that the models that scale more efficiently and effectively to higher dimensional feature spaces.
As this technology matures and gets commercialized, some fear that it may automate data scientists out of jobs. I think any such concerns are overblown. That’s because every one of the machine-learning pipeline automation scenarios these tools support—data visualization, preprocessing, feature engineering, algorithm selection, hyperparameter tuning, benchmarking, and diagnostics—requires a data scientist to set it up, monitor how it proceeds, and evaluate the results. In other words, expert human judgment will remain essential for ensuring that automation of machine learning development doesn’t run off the rails.
As I’ve stated elsewhere, manual quality assurance will always remain essential a core task for which human data scientists will be responsible, no matter how much their jobs get automated.
About the author: James Kobielus is SiliconANGLE Wikibon‘s lead analyst for Data Science, Deep Learning, and Application Development.