Follow Datanami:
September 22, 2015

Trifacta Seeks Truce Between Data Wranglers and IT Chieftains

Trifacta today unveiled an updated version of its big data transformation software that should make it easier for data wranglers to adhere to the data management and security requirements of the corporate IT department.

The big data revolution has exacerbated the decades-long war that’s been waged between individual business units that make decisions, and the IT departments that maintain the systems that increasingly help make those decisions.

When it comes to cleansing and managing the data that feed today’s increasingly complex analytic systems, there have been multiple points of view about how best to go about that. To crudely summarize: the business units want independence and the agility to analyze any data as they see fit, while the corporate IT departments prefer standardization, homogenization, and strict adherence to process.

Trifacta has found itself in the unenviable intersection of these camps ever since it set out to build an automated data transformation solution for today’s popular big data platforms, namely Hadoop but also enterprise data warehouses. The company’s motto has been all about putting control over data transformation and cleansing decisions into the hands of the analysts who serve the business units. They are best equipped to say how messy raw data should look before being added into the official analytics systems, Trifacta says.

While Trifacta has achieved its early goals and has achieved some degree of automation with its tools, it hasn’t appeased both sides. And so the company has spent much of the past year building data governance facilities into its software to ensure that the IT department’s needs for standardization and adherence to process are being met.TRIFACTA_ID_STACKED[1]

Adam Wilson, the CEO of Trifacta, has a front-row seat to this latest back-and-forth over how best to do data governance and maintain security in a huge data world.

“The idea that there’s going to be an intergalactic-schema that’s defined and anticipates all the business needs for data and pre-defines the metadata and definitions and is handed down from the high priest and given to the masses, unfortunately is too brittle and has not proven to be effective in most scenarios,” Wilson tells Datanami.

However, the anything-goes, Wild West gun show that has everybody working independently and accessing Hadoop on a self-service basis and–which is perhaps how the pendulum has swung lately–also cannot fly in the long run, Wilson admits.

“Then everybody talked about self-service without thinking through the data governance aspect of things–that also is not the right answer,” he continues. “I think that people are hungry for something that allows them to get access at data more readily, and allows them to have a data governance approach that’s more participatory.”

Triangulating a Truce

These concerns drove a lot of the development of version 3 of Trifacta’s eponymous product. Whereas earlier versions helped analysts get cleaner data in a better and faster way, the new version is aimed at ensuring that those processes don’t run afoul of the emerging data management requirements of the IT department.

To that end, Trifacta version 3 brings support for Hadoop security tools, including Kerberos for user authentication. It also adds support for Apache Sentry, the Cloudera-sponsored project that governs what users can access in Hadoop, as well as for Apache Ranger, the Hortonworks-sponsored project that provides centralized security administration for Hadoop clusters.

apache sentry logo“It’s pertinent for us to make sure we can integrate to those security standards, so we’ve added support in the product,” says Wei Zheng, Trifacta’s vice president of products. “From an IT perspective, there’s a lot of concern around infrastructure and manageability. This is a signal to me that the ecosystem is maturing, and if you look at Hadoop implementation partners like Cloudera and Hortonworks, their last two releases have been very focused on enterprise standards around security and governance.”

Version 3.0 also includes provisions for improved metadata tracking in Hadoop, which is another potential thorn in the IT department’s side. Trifacta’s customers can now see how the transformations ran from within the Cloudera Navigator product.

“So you’ll be able see wrangle scripts in their full form within the Navigator product,” Zheng says. She added that Trifacta also has a follow-up roadmap item to also read data from Navigator, and is working in a similar way with Hortonworks and its Atlas initiative.

Trifacta is in talks with numerous Fortune 500 companies that are trying to get a handle on their big data transformation challenges. This includes companies that take security quite seriously, like banks and healthcare companies. But large enterprises in general tend to be fairly strict about what data their users can access. Fitting into these existing constructs—not fighting them like a solitary data wrangler out on the big data range—is the best path toward customer productivity and increased license sales.

“A soon as talk about deploying it to their user base, we get questions about security controls, and how can we ensure users can get access to do their data but that they’re not accessing someone else’s data or seeing what they aren’t supposed to see?” Zheng says.

Version 3 does bring some functional enhancements that data wranglers will appreciate, including a new facility called Transformation Builder that makes it easier for users to manually develop transformation scripts (Trifacta does most of the code-generating automatically). It also offers a new data connectors for Amazon S3 and Redshift, and for Apache Hive. V3 also introduces a multi-split transformation, which will automatically split complex records with more than one type of delimiter.

At the end of the day, though, Trifacta version 3 is all about keeping the business analysts and the IT chieftains happy. It’s about towing the line between enabling business department’s agility, while not straying too far beyond the bounds set by the IT department.

“If we facilitate that line of communication and collaboration between business and IT, we think that’s a big win for everybody who’s trying to move fast and ultimately be more data driven,” Wilson says.

Related Items:

Six Core Data Wrangling Activities

One Deceptively Simple Secret for Data Lake Success

Taming the Wild Side of Hadoop Data

Datanami