October 9, 2014

How Spark Drives Midsize Data Transformation for Trifacta

Alex Woodie

Trifacta was founded several years ago with the goal of automating the task of cleansing and transforming big data, and the combination of Hadoop and MapReduce was a good match. As customers adopted the product, they realized they needed to transform smaller data sets as well, and that led Trifacta to make several changes in version 2, including adopting the in-memory Spark engine.

The first version of Trifacta‘s Data Transformation Platform supported two basic modes of operation. A browser-based client allowed a user to explore his data and begin transforming the data manually. Then, machine learning algorithms built into the software use the browser sessions to build transformation scripts that are executed as MapReduce jobs over big data sets. This approach gives a 10x multiplier to the amount of data an analyst can process, the company says.

But according to Trifacta’s CTO Sean Kandel, relying on the two modes left a significant gap between the small amount data that could be loaded into memory on the PC and the massive amount of data that a MapReduce job could work with.

“What we found is there are a number of use cases where the data doesn’t fit in a browser, but you want interactivity over a fairly large data set, so we’ve extended our support for execution engines by supporting technologies like Spark and supporting Parquet as well,” Kandel says.

The focus with Trifacta Data Transformation Platform version 2–which was launched today–is on handling a greater variety of data types and sizes. Support for Parquet, Avro, and JSON data types helps with the former, while the adoption of Apache Spark as a third execution engine helps with the latter.

“For data that doesn’t quite fit in browser, you still want that interactive experience,” Kandel says. “By leveraging in-memory processing [in Spark] it allows you to get much more interactive rates than you would normally get by pushing into a batch system like MapReduce. Also, by maintain the data in memory, you can run lot of interactive algorithms over it, which is good for us in terms of how we do data profiling and for driving predictive interaction.”

Some Trifacta customers, such as Lockheed Martin and LinkedIn, are using the tool to transform massive data sets that are measured in the terabytes and petabytes. But the company also has its share of customers who need a way to transform lots of smaller data sets that perhaps are measured in the megabytes or gigabytes. While some companies have a small number of massive data sets, others are trying to juggle and join 10 or more sets of data that are smaller in size, and Trifacta is being called on to help.

“Initially Trifacta made a commitment to focus on the high end where you’re seeing terabytes to petabytes being transformed,” says Stephanie McReynolds, vice president of marketing for Trifacta. “What we’ve seen in the market is there’s also a healthy set of requirements to process midsize and small data.”

In particular, Trifacta is keen on pushing transformation scripts based in Wrangle (its declarative scripting language for data transformations) down into Spark’s Resilient Distributed Datasets (RDDs) to leverage its in-memory processing capability.

“We now feel, with version 2, we have complete coverage of the entire spectrum, from small data to big data use cases, in the highest performant way by translating down to the right execution engine the transformation scripts that folks want to run, and do that in an automated way,” McReynolds says.

Version 2 also add a new visual data profiling capability, which aids data analysts in their initial understanding of new data sets. “So as soon as you get your hands on a new data set or data source,” Kandel says, “you can immediately start understanding the types of data that are available to you, if there data quality issues like errors or outliers, and more generally the shape and structure of that data so it can give you a head start in understanding if you need to transform your data and how you need to do it.”

Support for Parquet, a compressible columnar data format for Hadoop that’s used extensively in the Impala engine, will make it easier for Trifacta customers to integrate with the various analytic tools that use Parquet to access the Impala data store, including the visualization tools from Tableau Software.

In other news, several of Trifacta’s customers, including Autodesk, LinkedIn, MarketShare and Orange Silicon Valley, will be discussing their use of the company’s products at the Strata + Hadoop World conference next week. We’ll do our best to bring you their stories.

Related Items:

Has Dirty Data Met Its Match?

Trifacta Gets $12M to Refine Raw Data

Automating the Pain Out of Big Data Transformation

Applications: Enterprise Analytics, Visualization

Technologies: Middleware

Sectors: Financial Services, Healthcare, Manufacturing, Retail

Vendors: Tableau, Trifacta

Tags: data transformation, midsize data

Only registered users may comment. Register using the form below.

Check off newsletters you would like to receive*
- HPCwire
- EnterpriseTech
- Datanami
- Technology Conferences & Events
- Advanced Computing Job Bank
- Technology Product Showcase
Email*
Name*
First Last
Organization*
Job Function*
Industry*
Country*
City*
State*
Province*
- Please check here to receive valuable email offers from Datanami on behalf of our select partners.

How Spark Drives Midsize Data Transformation for Trifacta

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 15, 2024

April 12, 2024

April 11, 2024

Sponsored Partner Content

Get your Data AI Ready – Celebrate One Year of Deep Dish Data Virtual Series!

Supercharge Your Data Lake with Spark 3.3

Learn How to Build a Custom Chatbot Using a RAG Workflow in Minutes [Hands-on Demo]

Overcome ETL Bottlenecks with Metadata-driven Integration for the AI Era [Free Guide]

Gartner® Hype Cycle™ for Analytics and Business Intelligence 2023

The Art of Mastering Data Quality for AI and Analytics

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Building an Operational Data Warehouse for Real-time Analytics

Can You Use Kafka as a Database?

Sponsored Multimedia

The Power of DataOps: Bring Automation to Life
No Comments

Tactical Steps for Cloud Migration
No Comments

Immuta Data Access Platform
No Comments

Data Mesh: Fact or Fiction?
No Comments

Contributors

Featured Events

Call & Contact Center Expo

AI & Big Data Expo North America 2024

AI Hardware & Edge AI Summit 2024

CDAO Government 2024

How Spark Drives Midsize Data Transformation for Trifacta

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 15, 2024

April 12, 2024

April 11, 2024

Most Read Features

Most Read News In Brief

Most Read This Just In

Sponsored Partner Content

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Sponsored Multimedia

Contributors

Featured Events

Share

Copy short link