November 4, 2020

Brainome Right-Sizes Your Data Before ML Training

Alex Woodie

(metamorworks/Shutterstock)

A startup called Brainome today launched a new product designed to help data scientists determine how much data they need to sufficiently train their machine learning models. In addition to cutting costs, the software can also help data scientists avoid overfitting their models.

Called Daimensions, the Python-based tool essentially works as a compiler that generates the “memory equivalent capacity” of one’s data. Based on this figure, data scientists can whether there’s enough data to extract a meaningful signal. The tool also tells the user about the capability to generalize from the data, and also helps them with feature selection.

“We’re doing measurements on the data set in the context of the model you want to build,” says Bertrand Irissou, Co-Founder and CEO at Brainome. “We’re going to tell you, do you have enough data? Or do you have too much data? That’s an actual possibility.”

For example, given the progression of numbers “2, 4, 6, 8, 10,” it’s pretty obvious what the pattern is, and what the next number will be, Irissou says. There is no need to take the progression out into the millions, because it will only add costs for gathering, storing, and processing data.

“If I tell you ‘6, 4, 9, 8,’ the next number is not so obvious, and probably shouldn’t be, because they are the last four digits of my phone number and they’re supposed to be random,” Irissou tells Datanami. “So being able to detect from get-go whether or not there is information in the data is actually very important.”

Daimensions is based on a realization by Brainome co-founder and CTO Gerald Friedland, who is a data scientist and a professor at UC Berkeley. Friedland was working on a data science project with physicists, who became concerned about what happened to their carefully collected data when Friedland loaded them as parameters into a black box machine learning model.

“They said, you lose all the significance [of the data] that we created over hundreds of years,” Friedland says. “They were right.”

(Who-is-Danny/Shutterstock)

Instead of the brute-force approach that is typical of deep learning today, Friedland instead sought to identify significant patterns in the data before building the model, and based on those identified patterns, take a more targeted approach to building a more meaningful (and smaller) machine learning model.

Inspired by David MacKay’s 2003 book, “Information Theory, Inference, and Learning Algorithms,” Friedland started building a framework that implemented this approach, which he documented in his 2018 paper “A Practical Approach to Sizing Neural Networks.” He teamed up with Irissou to launch San Francisco-based Brainome, which is now coming to market with its first product.

In addition to cutting costs related to cloud data storage and GPU runtime, Brainome hopes its software can help save another precious commodity in short supply in the data science world: time.

For example, say you wanted to build a bridge. A good civil engineer would approach that challenge by first methodically measuring the most important parameters for that bridge, including its length, width, wind sheer, etc, Irissou says.

“Then you figure out the correct structure you’re going to use for that particular bridge, how much material you need, how much time you’re going to need,” he says. “By the time you drive your first truck over it, you know your bridge isn’t going to collapse.

“Machine learning today is kind of like building 100 bridges at a time and figuring out which one is going to collapse,” he continues. “What we’re doing in a fundamentally different way is taking the guesswork out of figuring out what is your need for data quantity, what is the feature engineering process that you’re going to use, and really empower the data scientist to reduce their experimentation cycles from weeks to just literally a few hours.”

Brainome is already working with several firms, including Cedar-Sinai Medical Center. The Los Angeles, California hospital is using the company’s software in a study of genes impact in cancer. The data at play is massive, with over 20,000 genes in the mix. Cedar-Sinai uses Brainome at the front-end to figure out which of the 20,000 genes are predictive of cancer.

In Cedar-Sinai’s case, trying to figure out which features are important after the model has already been trained is a losing battle, Irissou says. “If you use a brute-force approach, it would be two to the power of N, N being the number of features,” he says. “Two to the power of 20,000 is just an astronomical number that would require an infinite amount of power to calculate.”

Another early adopter of Brainome is SK telecom, the South Korean telecommunications company. Eric Davis, vice president of SK telecom’s AI Language Tech Labs, says Brainome “has been a breath of fresh air” in helping the company tackle a problem in the healthcare domain.

“Our previous approach was time consuming, full of guesswork, and took over a week to iterate from feature extraction to experimentation to results,” Davis says in a press release published today. “Brainome took a lot of the guesswork out of data quantity needs and feature importance, allowing us to reduce our experimentation cycle from a week to mere hours. Equally as important, the easy-to-deploy Python model allowed us to spend more time on experimentation versus serving and deploying our model.”

By figuring out the importance of one’s data before building the machine learning model, the resulting size of the model can be shrunk significantly without giving up accuracy, Irissou says. It also helps to keep the model from memorizing the data, to generalize better, and to avoid overfitting.

“The point of measuring is you can figure out the complexity beforehand. Most models are grossly oversized. They may generalize, but the assumptions we [have is] you want modesl that are as small as possible,” Irissou says. “We do it to sell software. We’re not here to sell GPU hours in cloud. There’s very little incentive in the current ecosystem to say, oh let me give you a solution that gets you 1/100^th of the amount of GPU that you’re currently using and it’s going to give you models so small that you can run them on regular CPU.”

Brainome’s software is sold in a Docker container that can run in the cloud or on prem. The software works with structured data stored in a tabular format, such as CSVs. It’s not domain specific. At the moment, the software primarily is designed to work with neural networks and decision tree models, although it will also work with XGBoost, with plans to support others in the future, Friedland says.

The company is working on an enterprise version of the software that will have a nicer Web interface. Interested parties can experiment with the tool on small data sets through the company’s website at www.brainome.ai.

Keeping Your Models on the Straight and Narrow

Don’t Be a Big Data Snooper

Technologies: Cloud, Frameworks, Middleware

Sectors: Biosciences, Financial Services, Healthcare, Science, Telecommunications

Vendors: Brainome

Tags: Bertrand Irissou, Daimentions, generalization, Gerald Friedland, machine learning, overfitting

Only registered users may comment. Register using the form below.

Check off newsletters you would like to receive*
- HPCwire
- EnterpriseTech
- Datanami
- Technology Conferences & Events
- Advanced Computing Job Bank
- Technology Product Showcase
Email*
Name*
First Last
Organization*
Job Function*
Industry*
Country*
City*
State*
Province*
- Please check here to receive valuable email offers from Datanami on behalf of our select partners.

Brainome Right-Sizes Your Data Before ML Training

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 19, 2024

April 18, 2024

April 17, 2024

Sponsored Partner Content

Get your Data AI Ready – Celebrate One Year of Deep Dish Data Virtual Series!

Supercharge Your Data Lake with Spark 3.3

Learn How to Build a Custom Chatbot Using a RAG Workflow in Minutes [Hands-on Demo]

Overcome ETL Bottlenecks with Metadata-driven Integration for the AI Era [Free Guide]

Gartner® Hype Cycle™ for Analytics and Business Intelligence 2023

The Art of Mastering Data Quality for AI and Analytics

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Building an Operational Data Warehouse for Real-time Analytics

Can You Use Kafka as a Database?

Sponsored Multimedia

The Power of DataOps: Bring Automation to Life
No Comments

Tactical Steps for Cloud Migration
No Comments

Immuta Data Access Platform
No Comments

Data Mesh: Fact or Fiction?
No Comments

Contributors

Featured Events

Call & Contact Center Expo

AI & Big Data Expo North America 2024

AI Hardware & Edge AI Summit 2024

CDAO Government 2024

Brainome Right-Sizes Your Data Before ML Training

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 19, 2024

April 18, 2024

April 17, 2024

Most Read Features

Most Read News In Brief

Most Read This Just In

Sponsored Partner Content

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Sponsored Multimedia

Contributors

Featured Events

Share

Copy short link