Trifacta Gets $12M to Refine Raw Data
One of the most pressing problems affecting big data analysis projects is the difficulty in getting clean data to work with. Whether you’re planning to use MapReduce or a fancy machine learning algorithm to glean insights from your data, you’re not going to get very far into your project if the data is messy, unclean, or incomplete. Data transformation startup Trifacta just got $12 million to build a product that addresses this problem.
Trifacta co-founder and CEO Joe Hellerstein says there’s nothing wrong with the current crop of ETL tools, such as Informatica’s industry-leading product–so long as you don’t mind spending lots of time and money to prepare a relatively small amount of data for analysis. The products and the processes simply don’t scale to deal with today’s huge data volumes, he says.
“People spend the most amount of time at the data transformation stage, where you’re taking raw data just acquired and trying to manipulate it into a form that you can actually do work on,” Hellerstein tells Datanami. “This is a critical bottleneck in many organizations for what data analysts do and it’s a barrier for many business analysts for surfacing their own data and being able to take advantage of the plethora of data today.”
Trifacta cites a statistic from Gartner that doesn’t bode well for future of big data analysis. By the end of 2015, Gartner says that 85 percent of Fortune 500 companies will be unable to exploit big data’s advantages. The problem is not the lack of technology, but large amount of manual massaging of data required by humans. Data scientists spend up to 80 percent of their time transforming data–including evaluating, structuring, and cleansing it—which leaves little time for actually analyzing the data, Gartner says.
The high rate of today’s big data flows–whether structured or unstructured–and the rapid pace of business decisions based on these data flows require a different approach to transformation. This new approach combines a more intuitive user interface with new data algorithms, according to Hellerstein, who’s a computer science professor at University of California, Berkeley. That, in essence, is the product that Trifacta is in the process of building.
“The design challenge is how do you bring those two things together in an interface where the user is able to see things, to apply intuition, and interact in a very lightweight way with the data to flesh things out,” he says. “We have a strong philosophical belief that the way to get past this enormous pain of working with data right now involves new algorithms and also involves new user experience design. Those things have to be co-designed together. There’s not one algorithm that does this automatically. There’s not one great visualization that does this for you. It’s kind of the marriage of these…technologies, along with scalability.”
Hellerstein, who is on leave from his job at Cal to build the company, is joined at Trifacta by two other co-founders, including CTO Sean Kandel, a Ph.D. who hails from the same Stanford data visualization program that yielded the VizQL technology behind Tableau (a partner), and chief experience officer (CXO) Jeffrey Heer, a comp-sci professor at the University of Washington who has been awarded for his work developing novel user interfaces for exploring, managing, and communicating data.
Trifacta plans to use the $12 million investment to help bring the as-yet unnamed product to market (expected in the first part of 2014), as well as to ramp up sales and marketing. The Series B financing round was led by Greylock Partners and Accel Partners, and brings the San Francisco-based company’s total funding to $16.3 million.
In addition to the venture capital, Trifacta has the early backing of big data software leaders Tableau and Cloudera (Hellerstein went to graduate school with Cloudera co-founder Mike Olson). Tableau is great at creating visualizations of big data sets, Hellerstein says, but often there’s a lot of prep work required before it can be fully exploited in Tableau.
|Trifacta CEO and co-founder Joe Hellerstein|
“If you don’t have data fit for analysis–if it’s not fully structured or it hasn’t had its value coded in the way you want or it needs additional data or it needs to be reduced–those are all tasks that happen before you load into Tableau,” Hellerstein says. “They look to us to bring more data to the table for their customers to analyze. They see their customers reaching into Trifacta to try to bring data that otherwise you really couldn’t work with [in] Tableau.”
“Having clean, standardized data goes a long way toward helping people make sense of it,” chief development officer and Tableau co-founder Chris Stolte says in a press release. Olson, meanwhile, says Trifacta’s new interaction technology will “allow analysts to transform data at scale, efficiently, making the EDH [enterprise data hub] an even better home for data that matters to the business.” Jeff Hammerbacher, a Cloudera co-founder and chief scientist at the company, says “Trifacta’s software makes data transformation more visual, interactive, and efficient.”
The Trifacta product will be a Web-based application that sits in-line to the data flow, and is part of the overall data workflow. It takes raw data as input, allows a user to identify patterns in the data through various visualizations, then automatically generates code that executes the data transformation task against the entire data flow at scale, before outputting the data into Hadoop, an enterprise data warehouse, a stream-processing engine, or whatever the case may be.
Hellerstein envisions the software being used by data scientists, data analysts, business analysts, and also IT professionals whose sole job is performing data transformation (probably with one of the legacy tools that Hellerstein hopes to displace).
Trifacta is not the only company aiming to address the big problems surrounding big data transformation. Paxata launched at Strata this fall with $8 million in Series B financing, as well as a partnership with Tableau. Alteryx is also valued by Tableau for its capability to primp and prune data prior to analysis. Pentaho and Talend are also looking to bring a big data transformation solutions from the open source world. And don’t count out Informatica or the ability of new startups to ramp up quickly with the backing of VC firms eager to ride the big data gravy train.
Hellerstein expects that, as the big data phenomena continues its incredible run, organizations will be looking to software like Trifacta’s to replace or augment the data science skills that are in such short supply. “The joke was that I should solve the [data scientist shortage] problem by teaching more students at Berkeley,” Hellerstein says. “But that just doesn’t scale. Software has to come in and take away some of the repetitive and mechanizable tasks. Software is good at some stuff that people aren’t.”