Machine Learning Tool Seeks to Automate Data Science
MIT researchers will report details this week on a “data science machine” billed as being able to automatically derive predictive models from raw data using a “Deep Feature Synthesis” algorithm.
The algorithm is said to automatically synthesize features for machine learning. The researchers from MIT’s Computer Science and Artificial Intelligence Laboratory said they demonstrated the “expressiveness” of the generated features on three datasets from different domains. They plan to expand those demonstrations to more datasets.
Using an auto-tuning process, “we optimize the whole pathway without human involvement, enabling it to generalize to different datasets,” said researchers James Max Kanter and Kalyan Veeramachaneni in a paper to be presented at an international data science and analytics conference.
The auto-tuning capability is intended as a machine-learning pathway used to extract value from synthesized features, starting with such variables as age or gender. The resulting new features can be used to create new measurements such as the percentage of certain feature, they explained.
The algorithm follows relationships in the datasets to a base field, and then sequentially applies mathematical functions along that path to create a final feature. Then a machine learning is implemented and tuned using a probably theory called Gaussian Copula.
The researchers said they entered their Data Science Machine in three data science competitions against more than 900 other data science teams, besting more than 600 teams. “In two of the three competitions we beat a majority of competitors, and in the third, we achieved 94 percent of the best competitor’s score,” MIT data scientists said.
They asserted that the Data Science Machine has a role alongside humans. “Currently, data scientists are very involved in the feature generation and selection processes. Our results show that the Data Science Machine can automatically create features of value and figure out how to use those features in creating a model,” they added. While humans beat the machine for all datasets, the researchers argued that the machine’s success-to-effort ratio suggests “there is a place for it in data science.”
The Deep Feature Synthesis algorithm has its own set of parameters that affect the resulting synthesized features. Future worked aimed at “empowering” data scientists could focus on selecting these parameters to improve performance and overall system performance.
“Right now, the systems approach doesn’t involve much human interaction,” the researchers explained. “In the future, the Data Science Machine could expose ways for humans to guide and interact with the system, enabling the pairing human and machine intelligence.”
Expanding testing beyond three different datasets would help make the Data Science Machine a more useful tool for data science, they concluded.
The Data Science Machine and accompanying Deep Feature Synthesis algorithm were built on top of a MySQL database using the InnoDB storage engine for tables. The researchers said raw datasets were manually converted to a MySQL schema for processing by the Data Science Machine.
The Python programming language was used to implement the logic for calculating, managing and manipulating the synthesized features.