2021: The Year of the Feature Store
Don’t look now, but feature stores–systems for developing, maintaining, and monitoring the data features used by machine learning algorithms for training and inference–are popping up all around us. Amazon Web Services rolled out a feature store last month, and Splice Machine unveiled its offering today, and more are expected to join the race.
“I’m expecting 2021 to be the year of the feature store,” says Mike Del Balso, the CEO and co-founder of Tecton, a provider of a cloud-based feature store. Del Balso, who helped build the machine learning system for Google’s ad division before heading to Uber, takes credit for coining the term “feature store,” even if he wishes he had picked a different name.
“When we started Tecton almost two years ago, people didn’t really know what feature stores were,” Del Balso tells Datanami. “It was a thing where we’d get quizzical looks when we talked about feature stores. And now, people are reaching out to us all the time, saying ‘I need a feature store. I know exactly what it is. And I’m at the point now where we just need to put that in our stack tomorrow.’”
Feature stores help to automate critical data engineering tasks that are required to run machine learning models in production. The feature store is the part of the data pipeline that’s responsible for transforming raw data into a more structured form that can be used by the machine learning system.
Many organizations have tried to use ETL products and other tools to develop their own custom feature stores that fit with their machine learning environment. Invariably, these become a morass of digital brittleness that break easily, are slow to change, and acted as an anchor weighing down data science teams when they’re hoping to move quickly.
At Uber, Del Balso and his team developed Michelangelo to automate the entire spectrum of machine learning tasks, which also included managing data features.
“We didn’t set off with this grand ambition of having thousands of models in production at the company,” Del Balso says. “We were just partnering with one team at a time, helping them solve their problems. What we realized was that we actually spent most of our time just doing data work, data engineering stuff. It was the least sexy part of anything you can do in machine learning. That was consuming the vast majority of our time.”
Del Balso knew other companies were facing the same problems, and so he worked to develop software that could remove the engineering burden from transforming raw data features into features that can be plugged into machine learning algorithms.
“The big thing is, it unlocks the data scientists to be able to actually control what machine learning is happening in the production environment,” Del Balso says. “It’s the last missing step to give them real ownership of their work.
“Secondly it makes all these components reusable,” he continues. “So when the second data scientist, is taking on a new data science project, they’re not staring from zero. They have a whole library of already productionized pre-vetted signals that they can already use in production. It’s like a shopping list–let me try this out, and I can get a model going in no time.”
Tecton may be an early leader in the feature store category, but it has plenty of competition. For instance, Amazon Web Services last month added a feature store to Amazon Sagemaker, its popular machine learning development and runtime environment. The AWS feature store will help data scientists define the data features used in Sagemaker’s machine learning models, both for training and inference.
Last week, we saw Austin, Texas based feature store startup Molecula complete a $17.6 million funding round. Higinio Maycotte, the company’s CEO, stated: “The feature store is emerging as the most transformative category in the data space because it automates the preparation of data for machine-scale analytics and AI.”
Other companies have also singled out the feature store as a critical element of the MLOps pipeline. Seattle, Washington-based Kaskada, for instance, launched a feature store a year ago to help data scientists get back to what they do best—iterating with models and data—and leave the large-scale data engineering work to automated tools.
That brings us to Splice Machine, the big data software company that’s centered around its scalable SQL database. The San Francisco company today launched its own feature store aimed at helping customers move past time-consuming manual data preparation steps and automate the creation, management, and serving of data features in production machine learning environments.
“The capacity to create, share, explain and reliably reproduce features for a given model is paramount to the success of a data science team,” Splice CEO Monte Zweben states in a press release. “The old way of doing things meant data science operations were simply not scalable. The Splice Machine Feature Store enables you to harness complex analytics in real time and transform real-time data into features, so your models are never uninformed. It also stores feature history making training set creation a single click.”
As software vendors move to claim their spots in the burgeoning market for feature stores, some organizations will invariably look towards open source alternatives. The leading contender in that department is a project called Feast. There are a number of companies contributing to Feast, including Tecton and Google Cloud, so expect this project to gain more visibility in the months to come.