Language Flags

Translation Disclaimer

HPCwire HPC in the Cloud Digital Manufacturing Report Green Computing Report
Rogue Wave

October 30, 2012

Bashing Big Text Data with Simple PCs


Big data does not always come in easily analyzed numerical forms. Computers have almost always been better than humans at mathematics at any scale.

However, the job of determining whether a hotel review, for example, is positive or negative comes naturally to a person but less so to a computer. More and more small businesses need to be able to analyze millions of reviews in order to gain valuable customer feedback.

The problem is, not all companies have access to Hadoop clusters or supercomputers. For other companies, the return may not be worth the investment if the only goal is to garner customer sentiment. A paper written by Jan Zizka and Frantisek Darena of Mendel University in Brno, Czech Republic examines this problem. The question the paper asks is simple: how accurate can simple personal computers get relative to supercomputers or networks of computers with regard to text analytics?

Specifically, Zizka and Darena analyzed 2 million hotel reviews to find the words of high importance, such as location (ended up being always important), helpful, and excellent among others. This was done by creating a matrix of the 2 million reviews on one axis and the 200,000 words of English used to create those reviews. Each cell is then filled with how many times a particular word appears in a given review (often zero).

While useless to humans, that 2 million by 200,000 matrix helps the computer determine which words are used with frequency, which words are used in tandem, et cetera. With a little semantic help from programmers, the process of determining which keywords appear to hold importance becomes relatively straightforward.

However, it would be impossible for a personal computer to run through those 2 million reviews without crashing or taking several weeks. The solution is to take a chunk of that and split into subsets to run in parallel over a few personal computers. They took samples of 200,000 reviews, 100,000 reviews and 20,000 reviews and split them into subsets of a fraction of those samples. For example, the 200,000 reviews were split into subsets of 50,000, 40,000, 30,000, and 20,000.

While not ideal, the result ended up being fairly accurate for high-importance words such as location, friendly, excellent, and helpful. The correlation between the value of those words from the subset and the whole 2 million review database neared 1, especially for the largest sample and subset combination.

Ultimately, the goal is to make the dataset and the subsets as large as possible. This makes intuitive sense as the most accurate representation is to run through all the data (impossible in this case) while the least accurate would be to use subsets of one review each. The combination of a 200,000-review sample with 50,000-review subsets ended up being most fruitful and manageable for a single personal computer to accomplish in a single day.

Beyond the actual analytics, the paper also went into interesting detail as to why garnering sentiment from a text sample is so difficult. For the most part, that difficulty stems from inconsistent user reports. For example, many hotel visitors who write in English do not natively speak English and end up misspelling words. While a simple spell-check is terrific for someone writing a 500-word article, it can produce some unwanted results when run through 38 million words over 2 million reviews.

Either way, this paper shows that running text analytics on large datasets may be accessible to small hotel business and not just for those with supercomputers.

Related Articles

Research Aims to Automate the Impossible

The Algorithmic Magic of Trendspotting

Spelunking Shops and Supercomputers

Share Options


Subscribe

» Subscribe to our weekly e-newsletter


Discussion

There are 0 discussion items posted.

 
Cray CS300-LC

Sponsored Links

Sponsored Whitepapers

Parallel Performance of the IMSL C Numerical Library with OpenMP

05/21/2013 | Rogue Wave Software

Download whitepaper containing benchmark results depicting the speedup achieved as a result of incorporating OpenMP directives in the IMSL C Numerical Library, for portable, cross platform analytics.

Download this Whitepaper...

Best Practices in Big Data Storage - Sponsored by Cleversafe, Cray, DDN, NetApp, & Panasas

05/10/2013 | Cleversafe, Cray, DDN, NetApp, & Panasas

From Wall Street to Hollywood, drug discovery to homeland security, companies and organizations of all sizes and stripes are coming face to face with the challenges – and opportunities – afforded by Big Data. Before anyone can utilize these extraordinary data repositories, however, they must first harness and manage their data stores, and do so utilizing technologies that underscore affordability, security, and scalability.

Download this Whitepaper...

View the White Paper Library

Sponsored Multimedia

SGI President and CEO, Jorge Titinger, on Big Data

SGI President and CEO, Jorge Titinger, talks about SGI's history and leadership in HPC and how that has converged into Big Data Solutions.

View Multimedia

Cray CS300-AC Cluster Supercomputer Air Cooling Technology Video

The Cray CS300-AC cluster supercomputer offers energy efficient, air-cooled design based on modular, industry-standard platforms featuring the latest processor and network technologies and a wide range of datacenter cooling requirements.

View Multimedia

More Multimedia

SGI DataRaptor with MarkLogic Database

Job Bank

Datanami Conferences Ad

Featured Events

June 4-4, 2013
The Economist's Information Forum
San Francisco, CA
United States

June 10-13, 2013
Cloud & Big Data Expo
New York City, NY
United States

June 17-18, 2013
Forecast 2013
San Francisco, CA
United States

June 19-20, 2013
GigaOM Structure
San Francisco, CA
United States

June 26-27, 2013
2013 Hadoop Summit
San Jose, CA
United States

June 26-27, 2013
Big Data World Congress
London
United Kingdom

» View/Search Events

» Post an Event