DataTorrent
Language Flags

Translation Disclaimer

HPCwire Enterprise Tech HPCwire Japan


October 30, 2012

Bashing Big Text Data with Simple PCs


Big data does not always come in easily analyzed numerical forms. Computers have almost always been better than humans at mathematics at any scale.

However, the job of determining whether a hotel review, for example, is positive or negative comes naturally to a person but less so to a computer. More and more small businesses need to be able to analyze millions of reviews in order to gain valuable customer feedback.

The problem is, not all companies have access to Hadoop clusters or supercomputers. For other companies, the return may not be worth the investment if the only goal is to garner customer sentiment. A paper written by Jan Zizka and Frantisek Darena of Mendel University in Brno, Czech Republic examines this problem. The question the paper asks is simple: how accurate can simple personal computers get relative to supercomputers or networks of computers with regard to text analytics?

Specifically, Zizka and Darena analyzed 2 million hotel reviews to find the words of high importance, such as location (ended up being always important), helpful, and excellent among others. This was done by creating a matrix of the 2 million reviews on one axis and the 200,000 words of English used to create those reviews. Each cell is then filled with how many times a particular word appears in a given review (often zero).

While useless to humans, that 2 million by 200,000 matrix helps the computer determine which words are used with frequency, which words are used in tandem, et cetera. With a little semantic help from programmers, the process of determining which keywords appear to hold importance becomes relatively straightforward.

However, it would be impossible for a personal computer to run through those 2 million reviews without crashing or taking several weeks. The solution is to take a chunk of that and split into subsets to run in parallel over a few personal computers. They took samples of 200,000 reviews, 100,000 reviews and 20,000 reviews and split them into subsets of a fraction of those samples. For example, the 200,000 reviews were split into subsets of 50,000, 40,000, 30,000, and 20,000.

While not ideal, the result ended up being fairly accurate for high-importance words such as location, friendly, excellent, and helpful. The correlation between the value of those words from the subset and the whole 2 million review database neared 1, especially for the largest sample and subset combination.

Ultimately, the goal is to make the dataset and the subsets as large as possible. This makes intuitive sense as the most accurate representation is to run through all the data (impossible in this case) while the least accurate would be to use subsets of one review each. The combination of a 200,000-review sample with 50,000-review subsets ended up being most fruitful and manageable for a single personal computer to accomplish in a single day.

Beyond the actual analytics, the paper also went into interesting detail as to why garnering sentiment from a text sample is so difficult. For the most part, that difficulty stems from inconsistent user reports. For example, many hotel visitors who write in English do not natively speak English and end up misspelling words. While a simple spell-check is terrific for someone writing a 500-word article, it can produce some unwanted results when run through 38 million words over 2 million reviews.

Either way, this paper shows that running text analytics on large datasets may be accessible to small hotel business and not just for those with supercomputers.

Related Articles

Research Aims to Automate the Impossible

The Algorithmic Magic of Trendspotting

Spelunking Shops and Supercomputers

Share Options


Subscribe

» Subscribe to our weekly e-newsletter


Discussion

There are 0 discussion items posted.

 

Most Read Features

Most Read News

Most Read This Just In

Cray Supercomputer

Sponsored Whitepapers

Planning Your Dashboard Project

02/01/2014 | iDashboards

Achieve your dashboard initiative goals by paving a path for success. A strategic plan helps you focus on the right key performance indicators and ensures your dashboards are effective. Learn how your organization can excel by planning out your dashboard project with our proven step-by-step process. This informational whitepaper will outline the benefits of well-thought dashboards, simplify the dashboard planning process, help avoid implementation challenges, and assist in a establishing a post deployment strategy.

Download this Whitepaper...

Slicing the Big Data Analytics Stack

11/26/2013 | HP, Mellanox, Revolution Analytics, SAS, Teradata

This special report provides an in-depth view into a series of technical tools and capabilities that are powering the next generation of big data analytics. Used properly, these tools provide increased insight, the possibility for new discoveries, and the ability to make quantitative decisions based on actual operational intelligence.

Download this Whitepaper...

View the White Paper Library

Sponsored Multimedia

Webinar: Powering Research with Knowledge Discovery & Data Mining (KDD)

Watch this webinar and learn how to develop “future-proof” advanced computing/storage technology solutions to easily manage large, shared compute resources and very large volumes of data. Focus on the research and the application results, not system and data management.

View Multimedia

Video: Using Eureqa to Uncover Mathematical Patterns Hidden in Your Data

Eureqa is like having an army of scientists working to unravel the fundamental equations hidden deep within your data. Eureqa’s algorithms identify what’s important and what’s not, enabling you to model, predict, and optimize what you care about like never before. Watch the video and learn how Eureqa can help you discover the hidden equations in your data.

View Multimedia

More Multimedia

NVIDIA

Job Bank

Datanami Conferences Ad

Featured Events

May 5-11, 2014
Big Data Week Atlanta
Atlanta, GA
United States

May 29-30, 2014
StampedeCon
St. Louis, MO
United States

June 10-12, 2014
Big Data Expo
New York, NY
United States

June 18-18, 2014
Women in Advanced Computing Summit (WiAC ’14)
Philadelphia, PA
United States

June 22-26, 2014
ISC'14
Leipzig
Germany

» View/Search Events

» Post an Event