October 30, 2012

Bashing Big Text Data with Simple PCs

Ian Armas Foster

Big data does not always come in easily analyzed numerical forms. Computers have almost always been better than humans at mathematics at any scale.

However, the job of determining whether a hotel review, for example, is positive or negative comes naturally to a person but less so to a computer. More and more small businesses need to be able to analyze millions of reviews in order to gain valuable customer feedback.

The problem is, not all companies have access to Hadoop clusters or supercomputers. For other companies, the return may not be worth the investment if the only goal is to garner customer sentiment. A paper written by Jan Zizka and Frantisek Darena of Mendel University in Brno, Czech Republic examines this problem. The question the paper asks is simple: how accurate can simple personal computers get relative to supercomputers or networks of computers with regard to text analytics?

Specifically, Zizka and Darena analyzed 2 million hotel reviews to find the words of high importance, such as location (ended up being always important), helpful, and excellent among others. This was done by creating a matrix of the 2 million reviews on one axis and the 200,000 words of English used to create those reviews. Each cell is then filled with how many times a particular word appears in a given review (often zero).

While useless to humans, that 2 million by 200,000 matrix helps the computer determine which words are used with frequency, which words are used in tandem, et cetera. With a little semantic help from programmers, the process of determining which keywords appear to hold importance becomes relatively straightforward.

However, it would be impossible for a personal computer to run through those 2 million reviews without crashing or taking several weeks. The solution is to take a chunk of that and split into subsets to run in parallel over a few personal computers. They took samples of 200,000 reviews, 100,000 reviews and 20,000 reviews and split them into subsets of a fraction of those samples. For example, the 200,000 reviews were split into subsets of 50,000, 40,000, 30,000, and 20,000.

While not ideal, the result ended up being fairly accurate for high-importance words such as location, friendly, excellent, and helpful. The correlation between the value of those words from the subset and the whole 2 million review database neared 1, especially for the largest sample and subset combination.

Ultimately, the goal is to make the dataset and the subsets as large as possible. This makes intuitive sense as the most accurate representation is to run through all the data (impossible in this case) while the least accurate would be to use subsets of one review each. The combination of a 200,000-review sample with 50,000-review subsets ended up being most fruitful and manageable for a single personal computer to accomplish in a single day.

Beyond the actual analytics, the paper also went into interesting detail as to why garnering sentiment from a text sample is so difficult. For the most part, that difficulty stems from inconsistent user reports. For example, many hotel visitors who write in English do not natively speak English and end up misspelling words. While a simple spell-check is terrific for someone writing a 500-word article, it can produce some unwanted results when run through 38 million words over 2 million reviews.

Either way, this paper shows that running text analytics on large datasets may be accessible to small hotel business and not just for those with supercomputers.

Related Articles

Research Aims to Automate the Impossible

The Algorithmic Magic of Trendspotting

Spelunking Shops and Supercomputers

Tags: textual data

Only registered users may comment. Register using the form below.

Check off newsletters you would like to receive*
- HPCwire
- EnterpriseTech
- Datanami
- Technology Conferences & Events
- Advanced Computing Job Bank
- Technology Product Showcase
Email*
Name*
First Last
Organization*
Job Function*
Industry*
Country*
City*
State*
Province*
- Please check here to receive valuable email offers from Datanami on behalf of our select partners.

Bashing Big Text Data with Simple PCs

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 19, 2024

April 18, 2024

Sponsored Partner Content

Get your Data AI Ready – Celebrate One Year of Deep Dish Data Virtual Series!

Supercharge Your Data Lake with Spark 3.3

Learn How to Build a Custom Chatbot Using a RAG Workflow in Minutes [Hands-on Demo]

Overcome ETL Bottlenecks with Metadata-driven Integration for the AI Era [Free Guide]

Gartner® Hype Cycle™ for Analytics and Business Intelligence 2023

The Art of Mastering Data Quality for AI and Analytics

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Building an Operational Data Warehouse for Real-time Analytics

Can You Use Kafka as a Database?

Sponsored Multimedia

The Power of DataOps: Bring Automation to Life
No Comments

Tactical Steps for Cloud Migration
No Comments

Immuta Data Access Platform
No Comments

Data Mesh: Fact or Fiction?
No Comments

Contributors

Featured Events

Call & Contact Center Expo

AI & Big Data Expo North America 2024

AI Hardware & Edge AI Summit 2024

CDAO Government 2024

Bashing Big Text Data with Simple PCs

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 19, 2024

April 18, 2024

Most Read Features

Most Read News In Brief

Most Read This Just In

Sponsored Partner Content

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Sponsored Multimedia

Contributors

Featured Events

Share

Copy short link