June 25, 2019

The 4 Paradigms of Data Prep for Analytics and Machine Learning

Mike White

(Natalya Yudina/Shutterstock)

Data preparation has long been recognized for helping business leaders, analysts, and data scientists to ready and prepare the data needed for analytics, operations, and regulatory requirements. Today, the technology is becoming even more critical to deriving insights as most enterprise data is still not ready to be used by machine learning (ML) applications and involves significant effort to make it usable. In fact, most analytics or data science exercises still require data professionals to expend up to 80 percent of their time on tasks such as ingesting, profiling, cleaning, transforming, combining, and shaping data. While unfortunate, this considerable investment is necessary to ensure that raw data can be converted into reliable and useful information to drive business decisions, support operations, meet regulatory requirements, or predict optimal outcomes.

More recently, data preparation technology has evolved into a valuable tool that creates ML and data science workflows which enhance applications with machine intelligence, enabling the transformation of data into information on-demand. By empowering every person, process, and system in the organization to be more intelligent, business users who are closest to the data can prepare datasets quickly and accurately, with the help of built-in intelligence and smart algorithms. These users work within an intuitive, visual application to access, explore, shape, collaborate and publish data with clicks, not code, with complete governance and security. IT professionals are able to maintain the scale of data volumes and variety across both enterprise and cloud data sources to support business scenarios for immediate and repeatable data service needs.

However, not all approaches to data prep are the same, so it is important to understand the following four data prep paradigms before choosing the optimal data prep style for your organization.

Paradigm 1: Workflow vs Spreadsheet UI

Data practitioners considering a data preparation solution are confronted with many options, but the first step in the process should focus on whether the solution adopts a workflow-oriented user interface or a spreadsheet-like one. Knowing your data persona type (or the skill set of the user base) along with the type and variability of the data at hand will help you determine the ideal user interface paradigm.

(sasirin pamai/Shutterstock)

A workflow-based interface, also referred to as ETL (Extract Transform Load), provides a canvas for placing components or icons which represent a configurable data preparation task and connecting these components with lines in order to represent dependencies and lineage within the workflow. Due to this abstraction layer, the data content is not viewable until the pre-defined workflow or job is run and the output is browsed. It is important to note that this paradigm assumes the required transformations and joins are known at the time of creation. Multiple iterations and test phases may be needed to assess and validate that the output meets end user requirements.

Alternatively, a spreadsheet-based interface gives its end users a direct view into the data itself and presents each data attribute as a column, often with embedded visual cues for data sparsity, uniqueness, data type mismatch, and other anomalies. This view allows the result of each transformation or step to be seen dynamically throughout the data prep process. Built-in data profiling and data quality issue detection facilitates immediate resolution within the environment. This paradigm inherently reduces the amount of iterations and accelerates data prep cycles in which interactive data validation and transformation is a critical component of the use case.

Paradigm 2: Clicks vs. Code-based Approach

With the proliferation of point-and-click, drag- and-drop business intelligence tools, “ease of use” has become a key differentiator when considering data preparation software. However, the code-based approach remains a popular option for technical data users who prefer flexibility and lower software costs compared to purpose-built applications, which tend to be resistant to customization and come with a larger price tag. Furthermore, a code-based approach inherently requires a higher cost of skilled resources and maintenance. Every change to the code needs to go through a life cycle of development, test, quality assurance, and production.

Paradigm 3: Sample vs. Full Data Perspective

(metamorworks/Shutterstock)

There are use cases which require a complete data population, such as master data migration, regulatory reporting, and fraud analysis. Likewise, there are use cases which are best performed using a relevant sample or subset of the data, such as predictive analytics and marketing segmentation. The business needs and data characteristics of the use case should ultimately drive the decision when adopting a data preparation solution or approach within the organization.

For instance, a sample-based approach will increase the risk of missing some data quality issues, which can have a huge impact depending on the use case, as the size and sophistication of the data sample varies from product to product. Some tools enforce a hard-coded sample limit, while others allow you to select your sample size depending on your use case and available processing resources.

A full data perspective, which provides the ability to work with all data records and column attributes in a given dataset, enables a comprehensive approach to data profiling and data quality. Full dataset visibility can have a significant impact on data accuracy and delivering reliable information quickly depending on the user’s understanding of the business context and the use case scenario.

Paradigm 4: Stand-alone Application vs. Vendor Add-on

Another often overlooked factor is whether the solution exists as a stand-alone offering or as part of a pre-existing BI or analytics application, data science tool, or ETL environment. There are implications for selecting a data preparation offering merely because it is available as an add-on to an in-house application. The risk of limited capabilities to meet specific needs arises and you must determine whether the risks are offset by the benefits provided by an integrated solution. In these cases, the comparison factors and considerations described above should be similarly applied.

Regardless of the which data preparation paradigm makes sense for your organization, it is critical to understand the relative strengths and challenges of the 4 primary styles of data prep. Giving careful consideration to each strength and challenge in light of the use case scenarios, the data characteristics, and the individuals performing the data work will ensure the highest chance of success.

Here’s a chart showing how the strengths and challenges of the different approaches stack up:

Data Prep Approach

Strengths

Challenges

Excel-based Data Prep

Most advantageous for finance & accounting; one-off scenarios for data profiling and data cleaning; and ad-hoc use for scenarios that fit within one million rows.

Ubiquitous
Highly customizable & editable
Hundreds of functions (400+)

Manual and error-prone
Data limitations (size & formats)
No collaboration or workflow reuse
Lack of process transparency and data governance
Proliferation of spreadsheet data in the organization

Workflow-based Data Prep The best option for enterprise data warehouse and data mart loading; advanced data mapping; application and business process integration; and incremental data and delta loading.

Sophisticated data flow management
Complex data transformations
Support for high number of endpoints
Advanced job scheduling for bulk workloads

Requires in-depth knowledge of source and target systems
Mainly batch-oriented (not interactive/ad hoc)
Time and cost to build and deploy
Skills and effort needed to learn

SQL/Scripting-based Data Prep

Lends itself well to machine learning focused scenarios; advanced or highly customized business logic; and for embedding data transformation logic/code into business applications.

Flexibility
Low Cost
Broad market skill availability

Lack of reusability
Resistant to scalability
Lack of process transparency or built-in governance
Difficult to manage or integrate multiple programming languages

Interactive Self-service Data Prep

Is best for overall data analytics and reporting; data profiling and modern data quality; data blending and integration; preparing data for machine learning; data governance; data lake exploration; and application or data consolidation and migration.

Built-in algorithmic intelligence to detects joins, similarities, and anomalies
Automatic versioning, step recording, and data lineage
Business user-configured automation, scheduling, and monitoring
Dataset and project reuse and collaboration

No support for PDF/Word document parsing

* No unstructured data import (e.g. video, audio)

Organizations today continue to look for ways to prepare data quickly and more accurately to solve their data challenges and to enable machine learning. Data prep technology helps business analysts, data scientists, and ML practitioners rapidly prepare and annotate their data to extend the value of the data across the enterprise for analytic workloads. Regardless of which paradigms are currently in use at your organization, self-service data preparation solutions enable ML and data science workflows which enhance applications with machine intelligence. More importantly, they enable them to transform data into information on-demand to empower every person, process, and system in the organization to be more agile and intelligent.

About the author: Mike White is a Certified Business Intelligence Professional (CBIP) who is currently focused on developing data analyst-centric solutions and content at Paxata. In his former role as a Senior Solutions Consultant, he worked with several prospective customers and partners to demonstrate the value Paxata can provide to their organizations. Prior to joining Paxata in 2014, he worked for Ernst & Young as an enterprise analytics consultant where he managed teams focused on risk and compliance analytics and business performance reporting for clients in the high tech, media & entertainment, and healthcare industries. Mike has a Master’s in Management Information Systems from Brigham Young University.

Related Items:

The Seven Sins of Data Prep

Self-Service Data Preparation – At Scale or Sampling?

Why Self-Service Prep Is a Killer App for Big Data

Applications: Data Mining

Technologies: Middleware

Sectors: Financial Services

Vendors: Paxata

Tags: data prep, excel, Mike White, paxata, sample, self-service

Only registered users may comment. Register using the form below.

Check off newsletters you would like to receive*
- HPCwire
- EnterpriseTech
- Datanami
- Technology Conferences & Events
- Advanced Computing Job Bank
- Technology Product Showcase
Email*
Name*
First Last
Organization*
Job Function*
Industry*
Country*
City*
State*
Province*
- Please check here to receive valuable email offers from Datanami on behalf of our select partners.

The 4 Paradigms of Data Prep for Analytics and Machine Learning

Paradigm 1: Workflow vs Spreadsheet UI

Paradigm 2: Clicks vs. Code-based Approach

Paradigm 3: Sample vs. Full Data Perspective

Paradigm 4: Stand-alone Application vs. Vendor Add-on

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 26, 2024

April 25, 2024

April 24, 2024

Sponsored Partner Content

Get your Data AI Ready – Celebrate One Year of Deep Dish Data Virtual Series!

Supercharge Your Data Lake with Spark 3.3

Learn How to Build a Custom Chatbot Using a RAG Workflow in Minutes [Hands-on Demo]

Overcome ETL Bottlenecks with Metadata-driven Integration for the AI Era [Free Guide]

Gartner® Hype Cycle™ for Analytics and Business Intelligence 2023

The Art of Mastering Data Quality for AI and Analytics

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Top 6 Strategies for Reducing Data Warehouse Costs

Building an Operational Data Warehouse for Real-time Analytics

Sponsored Multimedia

The Power of DataOps: Bring Automation to Life
No Comments

Tactical Steps for Cloud Migration
No Comments

Immuta Data Access Platform
No Comments

Data Mesh: Fact or Fiction?
No Comments

Contributors

Featured Events

AI & Big Data Expo North America 2024

CDAO Canada Public Sector 2024

AI Hardware & Edge AI Summit Europe

AI Hardware & Edge AI Summit 2024

CDAO Government 2024

The 4 Paradigms of Data Prep for Analytics and Machine Learning

Paradigm 1: Workflow vs Spreadsheet UI

Paradigm 2: Clicks vs. Code-based Approach

Paradigm 3: Sample vs. Full Data Perspective

Paradigm 4: Stand-alone Application vs. Vendor Add-on

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 26, 2024

April 25, 2024

April 24, 2024

Most Read Features

Most Read News In Brief

Most Read This Just In

Sponsored Partner Content

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Sponsored Multimedia

Contributors

Featured Events

Share

Copy short link