March 30, 2016

Architecting Immediacy-The Design of a High-Performance Portable Wrangling Engine

Seshadri Mahalingam

At Strata + Hadoop World San Jose this week, I will present with my fellow Trifacta colleague, co-founder Joe Hellerstein, a session entitled “Architecting immediacy: The design of a high-performance, portable wrangling engine.”

A big part of our session will be discussing  our new Photon Compute Framework, an enhancement at the core of Trifacta’s data wrangling interface. Photon is specifically architected to provide Trifacta’s users with a richly interactive and intelligent data wrangling experience on large, in-memory data sets.

Why is this a big deal? 

Today, we’re used to receiving feedback fast and at Trifacta, we believe data wrangling shouldn’t be any different – performance is essential to the user experience we are pioneering.

Trifacta delivers immediate feedback and intelligent suggestions every time you interact with it. For data scientists and analysts, users are never removed from the flow of their work or forced to wait for processing to complete. Photon’s in-memory engine allows users to interactively wrangle data volumes orders of magnitude more than was previously possible, with rich visualizations to assess data quickly. We engineered Photon for speed, with critical in-memory performance features for modern architectures including multi-threaded parallelism, columnar compression, pipelined data processing and the ability to leverage LLVM for compilation. Yet it only requires a minimal memory footprint.

The improvements offered by Photon are also a step forward for high-performance interoperability. As part of Photon’s development, Trifacta has been collaborating on the design of Apache Arrow with leading open-source organizations including Cloudera, Databricks, Twitter, MapR and Dremio. Arrow is an open-source representation for high-performance compute frameworks to interchange data in memory at the full speed of modern processors. In addition, Photon snaps into Trifacta’s Intelligent Execution architecture to run side-by-side with more resource-intensive distributed computing frameworks like Spark and MapReduce which Trifacta supports for big data processing.

Want to learn more?

Trifacta will unveil Photon at Strata + Hadoop World in San Jose. The product will be launched at the “Architecting immediacy,” session, during which Joe and I will discuss Photon in greater depth. The talk will highlight pain points endemic to data wrangling, including heavy string manipulation, data profiling and second-order transformations, and will demonstrate how we designed it for a fluid, immersive data wrangling experience.

The session is Thursday at 1:50. For more information, click here.

About the author: Seshadri Mahalingam is a senior software engineer at Trifacta and has been with the company since January 2013. In addition to building out wrangle, Trifacta’s domain-specific language for expressing data transformation, he develops the low-latency compute framework that powers Trifacta’s fluid & immersive data wrangling experience. Seshadri holds a B.S. in EECSfrom U.C. Berkeley, where he co-taught a class on open-source software.

 

Share This