Follow Datanami:
November 8, 2022

What’s Hot in the Data Preparation Market: A Look at Tools and Trends

Armon Petrossian

(SkyPics Studio/Shutterstock)

With economic uncertainties on the horizon, it’s an opportune time for businesses to start looking at how they can automate manual processes and remove bottlenecks within their organizations. The data dams have opened—IDC predicts there will be 175 zettabytes of data worldwide by 2025—and even a struggling economy is unlikely to slow the growth of organizations generating and using data.

As budgets tighten across industries, businesses that successfully optimize operations and find efficiencies in their pipelines (product, data, or otherwise), are setting up to not only survive, but thrive in a more challenging economic environment. For data-driven organizations, one field ripe for disruption and optimization is data preparation tools.

Most data preparation tools can be broken into two camps: GUI-based tools and code-first tools. GUI-based tools are easy to use and can be a great option for smaller businesses who don’t require the ability to do data transformations at scale. However, these tools can’t be integrated with source control systems, making it difficult to implement software best practices.

Code-first tools have what it takes to keep up with enterprise scale and offer ample flexibility. But this perk is a double-edged sword; in order to use a purely code-first tool, the user must know how to write and manage complex code. It can also be incredibly inefficient to code everything at an enterprise scale. Realistically, most enterprises need a balance of these two things—the ease-of-use of a GUI-based tool (not everyone is a software engineer, after all) and the scale and flexibility that code-first tools provide.

Let’s take a look at the different types of tools and trends playing out in this space currently.

GUI-based Tools

(Alexander Geiger/Shutterstock)

GUI-based tools are a solid solution for self-service business users. They don’t require any code, so anyone can use them, and they let users quickly build things and prepare data. There are a small subset of companies for which GUI-based tools meet all of the business’s needs. However, these tools are lacking in flexibility and can’t support data transformations at enterprise scale. Because of this, many companies have turned to code-first tools, but an everything-as-code approach is not a silver bullet.

Code-first Tools

In the past few years, there’s been a trend of appropriating software engineering best practices and applying them to analytics. By bringing software engineering processes to data, these code-first tools can support massive scale and flexibility. But despite the benefits of an everything-as-code approach, there are a few drawbacks.

First, most enterprises don’t have enough people who can write SQL and who understand how to apply software engineering principles to analytics. Second, most business users or analysts do not understand the importance of data architecture. Certain companies might be able to pull this off by hiring very technical people in every department, but this isn’t feasible for the vast majority of businesses. Additionally, there can be technical limitations to a code-first approach that cause some companies to scrap using a tool altogether and instead write their own code in their data warehousing platform.

A Balanced Approach

Ideally, modern enterprises need a tool that is GUI-driven so that anyone can use it, but also supports code for exceptional flexibility. New tools that feature a combination of both GUI and code elements have emerged to fit this need, allowing users to build more efficiently, with better governance, and less manual coding.

The right data transformation approach is an integral part of an organization’s data stack, as it sets the foundation for implementing key data strategies, including a renewed recognition of the importance of data modeling, as well as democratizing data access through a data mesh.

Trend #1: Data Modeling

Data modeling is a vital, often overlooked step in building a data warehouse. Simply put, it gives the user a high-level overview of what they’re trying to build with data prior to executing on it. Think of it in terms of construction; if you’re building a shed in your backyard, you probably don’t need a blueprint. But if you’re constructing a skyscraper, it would be absurd to start building without a plan in place.

The rise of code-first tools in combination with the cloud (which let users build things quickly with little analytics experience) has caused companies to lose sight of what they’re building. Imagine giving an enormous plot of land to someone who only knows how to build sheds; they’ll start building sheds on top of sheds. In the data warehouse, this translates into building things with no regard for how to manage the architecture at scale, or the relationships between different use cases i.e. ‘sheds.’

The industry has started to recognize the shortcomings of code-first tools and is pivoting toward a more balanced approach that addresses proper data modeling and the physical aspects of building data products or warehouses.

Trend #2: Data Mesh

Until recently, building a data warehouse has taken a centralized approach that starts and ends with a company’s IT team. But the IT team doesn’t have a comprehensive understanding of how the data they’re working with was generated, because it was generated by the business side of the house.

Data mesh aims to decentralize data ownership by breaking down the silos between IT and the business. This new approach allows businesses to create their own data pipelines, for example, while giving IT visibility into what’s happening so that they can properly govern it. This paradigm hasn’t been possible until now due to lack of tools, but expect to see more of an emphasis on data mesh in the coming months and years.

The data preparation landscape is undergoing notable change as trends like data mesh and data modeling become more widely adopted. At the same time, many companies and influencers within the space have started to take a closer look at the downsides of relying solely on a GUI-based or a code-first tool, instead opting for a more balanced solution that incorporates the best elements of both.

About the author: As co-founder and CEO, Armon Petrossian created Coalesce, the only data transformation tool built for scale. Prior, Armon was part of the founding team at WhereScape, a leading provider of data automation software. At WhereScape, Armon served as national sales manager for almost a decade.

Related Items:

Three Ways to Connect the Dots in a Decentralized Big Data World

Data Mesh Vs. Data Fabric: Understanding the Differences

Data Prep Still Dominates Data Scientists’ Time, Survey Finds