Brainome Right-Sizes Your Data Before ML Training
A startup called Brainome today launched a new product designed to help data scientists determine how much data they need to sufficiently train their machine learning models. In addition to cutting costs, the software can also help data scientists avoid overfitting their models.
Called Daimensions, the Python-based tool essentially works as a compiler that generates the “memory equivalent capacity” of one’s data. Based on this figure, data scientists can whether there’s enough data to extract a meaningful signal. The tool also tells the user about the capability to generalize from the data, and also helps them with feature selection.
“We’re doing measurements on the data set in the context of the model you want to build,” says Bertrand Irissou, Co-Founder and CEO at Brainome. “We’re going to tell you, do you have enough data? Or do you have too much data? That’s an actual possibility.”
For example, given the progression of numbers “2, 4, 6, 8, 10,” it’s pretty obvious what the pattern is, and what the next number will be, Irissou says. There is no need to take the progression out into the millions, because it will only add costs for gathering, storing, and processing data.
“If I tell you ‘6, 4, 9, 8,’ the next number is not so obvious, and probably shouldn’t be, because they are the last four digits of my phone number and they’re supposed to be random,” Irissou tells Datanami. “So being able to detect from get-go whether or not there is information in the data is actually very important.”
Daimensions is based on a realization by Brainome co-founder and CTO Gerald Friedland, who is a data scientist and a professor at UC Berkeley. Friedland was working on a data science project with physicists, who became concerned about what happened to their carefully collected data when Friedland loaded them as parameters into a black box machine learning model.
“They said, you lose all the significance [of the data] that we created over hundreds of years,” Friedland says. “They were right.”
Instead of the brute-force approach that is typical of deep learning today, Friedland instead sought to identify significant patterns in the data before building the model, and based on those identified patterns, take a more targeted approach to building a more meaningful (and smaller) machine learning model.
Inspired by David MacKay’s 2003 book, “Information Theory, Inference, and Learning Algorithms,” Friedland started building a framework that implemented this approach, which he documented in his 2018 paper “A Practical Approach to Sizing Neural Networks.” He teamed up with Irissou to launch San Francisco-based Brainome, which is now coming to market with its first product.
In addition to cutting costs related to cloud data storage and GPU runtime, Brainome hopes its software can help save another precious commodity in short supply in the data science world: time.
For example, say you wanted to build a bridge. A good civil engineer would approach that challenge by first methodically measuring the most important parameters for that bridge, including its length, width, wind sheer, etc, Irissou says.
“Then you figure out the correct structure you’re going to use for that particular bridge, how much material you need, how much time you’re going to need,” he says. “By the time you drive your first truck over it, you know your bridge isn’t going to collapse.
“Machine learning today is kind of like building 100 bridges at a time and figuring out which one is going to collapse,” he continues. “What we’re doing in a fundamentally different way is taking the guesswork out of figuring out what is your need for data quantity, what is the feature engineering process that you’re going to use, and really empower the data scientist to reduce their experimentation cycles from weeks to just literally a few hours.”
Brainome is already working with several firms, including Cedar-Sinai Medical Center. The Los Angeles, California hospital is using the company’s software in a study of genes impact in cancer. The data at play is massive, with over 20,000 genes in the mix. Cedar-Sinai uses Brainome at the front-end to figure out which of the 20,000 genes are predictive of cancer.
In Cedar-Sinai’s case, trying to figure out which features are important after the model has already been trained is a losing battle, Irissou says. “If you use a brute-force approach, it would be two to the power of N, N being the number of features,” he says. “Two to the power of 20,000 is just an astronomical number that would require an infinite amount of power to calculate.”
Another early adopter of Brainome is SK telecom, the South Korean telecommunications company. Eric Davis, vice president of SK telecom’s AI Language Tech Labs, says Brainome “has been a breath of fresh air” in helping the company tackle a problem in the healthcare domain.
“Our previous approach was time consuming, full of guesswork, and took over a week to iterate from feature extraction to experimentation to results,” Davis says in a press release published today. “Brainome took a lot of the guesswork out of data quantity needs and feature importance, allowing us to reduce our experimentation cycle from a week to mere hours. Equally as important, the easy-to-deploy Python model allowed us to spend more time on experimentation versus serving and deploying our model.”
By figuring out the importance of one’s data before building the machine learning model, the resulting size of the model can be shrunk significantly without giving up accuracy, Irissou says. It also helps to keep the model from memorizing the data, to generalize better, and to avoid overfitting.
“The point of measuring is you can figure out the complexity beforehand. Most models are grossly oversized. They may generalize, but the assumptions we [have is] you want modesl that are as small as possible,” Irissou says. “We do it to sell software. We’re not here to sell GPU hours in cloud. There’s very little incentive in the current ecosystem to say, oh let me give you a solution that gets you 1/100th of the amount of GPU that you’re currently using and it’s going to give you models so small that you can run them on regular CPU.”
Brainome’s software is sold in a Docker container that can run in the cloud or on prem. The software works with structured data stored in a tabular format, such as CSVs. It’s not domain specific. At the moment, the software primarily is designed to work with neural networks and decision tree models, although it will also work with XGBoost, with plans to support others in the future, Friedland says.
The company is working on an enterprise version of the software that will have a nicer Web interface. Interested parties can experiment with the tool on small data sets through the company’s website at www.brainome.ai.