
Analyze This: How to Prepare Your Unstructured Data in Six Steps

(BestForBest/Shutterstock)
We live in a data-rich world where information is ours for the taking. But throwing just any data at your algorithm is a bad idea. With AI, small inconsistencies quickly become big ones. And those mistakes affect your decision-making, reputation, and bottom line. That’s why you need to prepare your data before you hand it over to your algorithms.
Here’s how to put quality data in—so you get quality data out.
Step: 1 Clean Your Data
Junk data is part of life, especially with qualitative (text-based) data. Before you hit “upload” to your analytics platform, find and strip out low- or no-value data. You’ll improve your quality—and avoid wasting valuable processing credits.
Remove fields containing nothing, n/a, and gibberish. You can generally also remove very short text responses. Exceptions are when a question specifically asks for a very short response, or when users write something in a “further comments” box.
Sometimes you might want to add data, such as the term “N/A” to an empty cell. This will let the system process those data points instead of just skipping over blank text fields. Check how your system handles special characters—some will skip over text fields consisting of a certain percentage of these. Adjust these cells if you need to.
Step 2: Combine Like Data
Scraped or exported data often arrives in multiple files. You’ll get better results by combining your data into fewer files – or even just one. When deciding how to combine your data, know:
- What you’re looking for
- If you’re comparing and contrasting, and if so what your main comparison point is
- If you’ll aggregate, then sort data
- If your data sources need to be separated, and if so whether you’ll build separate dashboards for each source
- How big the files are (NB, combining sources creates very large files that take longer to process)
For example, say you’re comparing reviews from “App A,” “App B,” and “App C” from both the Apple App Store and the Google Play Store. The review data will be 6 files: one for each app from each source.
You can combine this data in a few different ways. You could collate the data from the Apple App store in one file and the data from the Google Play store data in another. Or save the reviews from each app across both stores into three separate files. Or you could combine all of the data into one large file.
Why one over the other? It depends on your goals. If you want to contrast the Apple App reviews with the Google App reviews, then two files makes sense. If you’re comparing the Apps themselves, then three files might make more sense. If you only need to process the data once, a single huge file is fine.
Step 3: Add Metadata
Metadata provides information about your data. It helps you find, use, filter, sort and preserve your data. The more metadata, the better—just so long as it’s good quality.
Always add the essentials:
- Document source
- Date(s) created
- Date(s) scraped/pulled
- Author
You can also add:
- URLs
- Groupings
- Notes
- Names
- Locations
- Tags
- Other relevant info
You can upload as much or as little metadata as you want. But more metadata makes it easier to sort and filter your data.
Step 4: Sort Your Metadata
Consistency matters. Properly format your metadata so that you can find and filter it in your system. Unformatted metadata just makes life harder for you. To get started:
- Check date formats
- Standardize formatting
- Fix misspellings or variations (Apple vs apple vs Apple Inc)
Uploading multiple documents to analyze, filter and graph? The formatting must be consistent across all the files so you can sort and compare them. In the above App store example, you’d want the source field to always be “Apple App Store” or “Google Play Store” across every document.
Step 5: Save Your File
If you’re using Excel or Google Docs to prepare your data, save two copies of your cleaned and prepared data: one in a native file type and one in a CSV. CSVs tend to upload faster. They’re also easy to open and edit on the fly, no matter what OS or software you use.
Step 6: Tune Your Sample Sets
Planning to tune and reprocess your data multiple times? Create special tuning sets. These are small selections of your data used to “tune” your configurations. Because they’re smaller, Kyour system can quickly process them. You’ll get timely feedback without burning through processing credits. Once you’ve tuned with your smaller sets, move on to a larger set to confirm that your results align with your expectations. Or repeat the process with a new tuning set.
Getting your data in great shape before you feed it to an algorithm will net you better results and let you get more from your tech. Use the data prep best practices above, and you’ll spend less time fixing and more time analyzing.
About the author: Paul Barba is the chief scientist at Lexalytics, an InMoment company and a provider of analytic solutions for structured and unstructured data. Paul has 10 years of experience developing, architecting, researching and generally thinking about machine learning, text analytics and NLP software. He earned a degree in Computer Science and Mathematics from UMass Amherst.
Related Items:
Stemming vs Lemmatization in NLP
Five Ways Big Data Projects Can Go Wrong (And What You Can Do About Them)
May 20, 2022
- Elastic Announces Expanded Collaboration With AWS
- IBM Enhances Global Data Platform to Address AI Adoption Challenges
May 19, 2022
- VAST Data Announces Newest Feature Releases
- Franz’s AllegroGraph 7.3 Extends GraphQL to Knowledge Graph Developers
- Tamr Introduces Tamr Enrich to Simplify and Improve the Data Mastering Process
- Yugabyte Partners With Banking Software Firm Temenos
- New Relic Expands Instant Observability Ecosystem
- Confluent Report: Real-Time Data Streams Boost Revenue and Customer Satisfaction
- Alteryx Announces New Cloud Capabilities
- Komprise Automates Unstructured Data Discovery with Smart Data Workflows
May 18, 2022
- Sylabs Readies for Native OCI Compatibility with Release of SingularityCE 3.10
- Qlik Announces 2022 Global Transformation Awards
- Ahana Announces New Presto Query Analyzer to Bring Instant Insights into Presto Clusters
- Imply Announces Dates and Details for Druid Summit On The Road
- TileDB Secures investment From Verizon Ventures
- New Relic Introduces Low-Overhead Kubernetes Monitoring
- Inspur’s AIStation Passes the CNCF Certified Kubernetes Conformance Program
- Neo4j ICIJ Announce the 2022 Connected Data Fellowship
- Heartex Raises $25M Series A to Accelerate Data-Centric AI
- Global Logistics Company Completes Data Modernization Milestone with Datometry
Most Read Features
- Five Ways Big Data Projects Can Go Wrong (And What You Can Do About Them)
- The Future of Data Management: It’s Already Here
- Google’s Massive New Language Model Can Explain Jokes
- d-Matrix Gets Funding to Build SRAM ‘Chiplets’ for AI Inference
- Payment Fraud at Record Lows Thanks to Analytics and AI, Visa Says
- How to Stop Failing at Data
- All Eyes on Snowflake and Databricks in 2022
- AI That Works on Behalf of Workers
- Meet Andrew Ng, a 2022 Datanami Person to Watch
- Will the Data Lakehouse Lead to Warehouse-Style Lock-In?
- More Features…
Most Read News In Brief
- Anaconda Unveils PyScript, the ‘Minecraft for Software Development’
- Looker Founder Helps Create New Data Exploration Language, Malloy
- Google Cloud Launches New Postgres-Compatible Database, AlloyDB
- Why So Few Are Mastering the Data Economy
- Google Debuts LaMDA 2 Conversational AI System and AI Test Kitchen
- SalesForce Taps LLM for Programming Boost with CodeGen
- Anaconda’s Commercial Fee Is Paying Off, CEO Says
- Data Visualization Platform Enso Emerges from Stealth with $16.5M
- Big Data Career Notes: May 2022 Edition
- Qumulo Giving Away 1PB of Free Cloud Storage
- More News In Brief…
Most Read This Just In
- CData Software and HULFT Announce Interoperability Partnership to Break Down Data Silos
- MariaDB Puts $25K on the Table in Distributed SQL Throwdown
- MariaDB Survey Reveals COVID-19’s Impact on Cloud Adoption
- Splunk Extends its Data-to-Everything Platform with Cloud and Machine Learning Advancements
- Tableau Announces New Capabilities to Empower Developers
- Grid Dynamics Unveils New ML-Based Price Optimization Starter Kit for Google Cloud Vertex AI
- Penn State Launches Master’s Degree in Spatial Data Science
- BattleFin Announces a Collaboration with AWS Data Exchange
- DataRobot to Host Inaugural AI Experience Worldwide Conference
- Apple Makes Mobility Data Available to Aid COVID-19 Efforts
- More This Just In…
Sponsored Partner Content
-
Everyday AI, Extraordinary People
-
Dataiku Makes the Use of Data and AI an Everyday Behavior
-
Data Fabrics as the best path for Enterprise Data Integration
-
Dataiku connects data and doers through Everyday AI
-
Leaving Legacy ETL Behind
-
Streamline Lakehouse Analytics with Matillion and Databricks SQL
-
Close the Information Gap: How to Succeed at Analytics in the Cloud
-
Who wins the hybrid cloud?
Sponsored Whitepapers
Contributors
Featured Events
-
CDAO Insurance
May 24 @ 8:00 am - May 25 @ 5:00 pmNew York NY United States -
ISC 2022: The Premier Forum for HPC
May 29 - June 2 -
DMWF Global
June 23 - June 24London United Kingdom -
CDAO Government
September 13 @ 1:00 pm - September 14 @ 5:00 pm -
CDAO Fall
October 10 - October 12Boston MA United States