How Spark Drives Midsize Data Transformation for Trifacta
Trifacta was founded several years ago with the goal of automating the task of cleansing and transforming big data, and the combination of Hadoop and MapReduce was a good match. As customers adopted the product, they realized they needed to transform smaller data sets as well, and that led Trifacta to make several changes in version 2, including adopting the in-memory Spark engine.
The first version of Trifacta‘s Data Transformation Platform supported two basic modes of operation. A browser-based client allowed a user to explore his data and begin transforming the data manually. Then, machine learning algorithms built into the software use the browser sessions to build transformation scripts that are executed as MapReduce jobs over big data sets. This approach gives a 10x multiplier to the amount of data an analyst can process, the company says.
But according to Trifacta’s CTO Sean Kandel, relying on the two modes left a significant gap between the small amount data that could be loaded into memory on the PC and the massive amount of data that a MapReduce job could work with.
“What we found is there are a number of use cases where the data doesn’t fit in a browser, but you want interactivity over a fairly large data set, so we’ve extended our support for execution engines by supporting technologies like Spark and supporting Parquet as well,” Kandel says.
The focus with Trifacta Data Transformation Platform version 2–which was launched today–is on handling a greater variety of data types and sizes. Support for Parquet, Avro, and JSON data types helps with the former, while the adoption of Apache Spark as a third execution engine helps with the latter.
“For data that doesn’t quite fit in browser, you still want that interactive experience,” Kandel says. “By leveraging in-memory processing [in Spark] it allows you to get much more interactive rates than you would normally get by pushing into a batch system like MapReduce. Also, by maintain the data in memory, you can run lot of interactive algorithms over it, which is good for us in terms of how we do data profiling and for driving predictive interaction.”
Some Trifacta customers, such as Lockheed Martin and LinkedIn, are using the tool to transform massive data sets that are measured in the terabytes and petabytes. But the company also has its share of customers who need a way to transform lots of smaller data sets that perhaps are measured in the megabytes or gigabytes. While some companies have a small number of massive data sets, others are trying to juggle and join 10 or more sets of data that are smaller in size, and Trifacta is being called on to help.
“Initially Trifacta made a commitment to focus on the high end where you’re seeing terabytes to petabytes being transformed,” says Stephanie McReynolds, vice president of marketing for Trifacta. “What we’ve seen in the market is there’s also a healthy set of requirements to process midsize and small data.”
In particular, Trifacta is keen on pushing transformation scripts based in Wrangle (its declarative scripting language for data transformations) down into Spark’s Resilient Distributed Datasets (RDDs) to leverage its in-memory processing capability.
“We now feel, with version 2, we have complete coverage of the entire spectrum, from small data to big data use cases, in the highest performant way by translating down to the right execution engine the transformation scripts that folks want to run, and do that in an automated way,” McReynolds says.
Version 2 also add a new visual data profiling capability, which aids data analysts in their initial understanding of new data sets. “So as soon as you get your hands on a new data set or data source,” Kandel says, “you can immediately start understanding the types of data that are available to you, if there data quality issues like errors or outliers, and more generally the shape and structure of that data so it can give you a head start in understanding if you need to transform your data and how you need to do it.”
Support for Parquet, a compressible columnar data format for Hadoop that’s used extensively in the Impala engine, will make it easier for Trifacta customers to integrate with the various analytic tools that use Parquet to access the Impala data store, including the visualization tools from Tableau Software.
In other news, several of Trifacta’s customers, including Autodesk, LinkedIn, MarketShare and Orange Silicon Valley, will be discussing their use of the company’s products at the Strata + Hadoop World conference next week. We’ll do our best to bring you their stories.