Follow Datanami:
December 2, 2019

Best Practices for Wrangling Data on your Cloud Data Lake or Data Warehouse

To get the most value from your investment in a cloud data warehouse or data lake, your organization must break through the biggest bottleneck to analytics and AI: data wrangling. To do so successfully, organizations should:

Empower All Stakeholders

Your data preparation processes and solutions should empower all stakeholders to coordinate and do their jobs faster and easier:

  • Data analysts, who need to explore, clean, blend, and aggregate data to improve time to value and open up new areas for insights.
  • Data scientists, who perform data exploration, analytics, modeling, and algorithm development on a wide variety of data sources and structures and collaborate with business leadership to determine analytical insights.
  • Data engineers, who design, build, and manage data processes and data architecture to support analytics and data science functions and need to automate data-related processes.

A data wrangling solution that offers self-service capabilities, combined with automation and orchestration to streamline data pipelines and provide centralized governance, can help all stakeholders make the best use of a cloud data lake or data warehouse.

Focus on the Right Use Cases

Traditional extract, transform, and load (ETL) grew up as a solution for standardizing transformation of data for carefully structured enterprise data warehouses. But when it comes to exploring, structuring, blending, and cleaning huge volumes of new, diverse, less-structured data, organizations need new alternatives for accelerating and automating these processes.

Focus on where your data analysts and data scientists struggle to get beyond traditional reporting, querying, and visualization methods — for example, using less structured data like IOT, application data, log data, etc. Focus on use cases involving lots of manual preparation work in desktop tools or code heavy environments. Focus on use cases where business teams rely on IT teams to provision datasets where requirements often change and results are needed regularly. Focus on free-ranging data exploration initiatives that exceed the capacity of standard SQL or ETL.

Ensure Data Quality at Scale with Continuous Validation

Cloud platforms often contain huge data volumes and a wide spectrum of data structures—everything from raw, semi-structured data to structured, transactional data from multiple systems. As such, cloud platforms open up a broader array of data to extract value from, which requires a more dynamic approach to data quality over more traditional rigid processes.

Your organization can improve the accuracy, consistency, and completeness of data in a cloud platform by using data wrangling solutions that combine a visual approach with machine learning to automate data cleaning procedures and provide insights into anomalies and data quality issues. Automation can handle the scale of cloud platforms and identify data values that appear to be incorrect, invalid, missing or mismatched.

Automate Preparation of Data for Downstream Analytics and Machine Learning

Your cloud platform is where a vast and growing volume data is collected from a huge number of sources, including Internet of Things (IoT) sensors, mobile devices, cameras, customer behavior, applications, and more. As the data generated by digital transformation explodes, so too does the opportunity for outcompeting on differentiated, value rich data.

Data wrangling routines should be scheduled, published, operationalized and shared to reduce redundancies and ensure broad access to value rich data. Your organization should consider automating data wrangling pipelines to:

  • Accelerate time to value
  • Reduce operational costs and
  • Improve monitoring and governance

Centralizing the scheduling, publishing, operationalizing of data wrangling routines results in less redundancy and inconsistency, more portability, and better management and governance.

Ready to Learn More?

With seamless data wrangling across any cloud, hybrid or multi-cloud environment, Trifacta is the ideal data wrangling solution for your cloud platform. Try Trifacta for yourself today!

 

Datanami