October 30, 2012

Bashing Big Text Data with Simple PCs

Ian Armas Foster

Big data does not always come in easily analyzed numerical forms. Computers have almost always been better than humans at mathematics at any scale.

However, the job of determining whether a hotel review, for example, is positive or negative comes naturally to a person but less so to a computer. More and more small businesses need to be able to analyze millions of reviews in order to gain valuable customer feedback.

The problem is, not all companies have access to Hadoop clusters or supercomputers. For other companies, the return may not be worth the investment if the only goal is to garner customer sentiment. A paper written by Jan Zizka and Frantisek Darena of Mendel University in Brno, Czech Republic examines this problem. The question the paper asks is simple: how accurate can simple personal computers get relative to supercomputers or networks of computers with regard to text analytics?

Specifically, Zizka and Darena analyzed 2 million hotel reviews to find the words of high importance, such as location (ended up being always important), helpful, and excellent among others. This was done by creating a matrix of the 2 million reviews on one axis and the 200,000 words of English used to create those reviews. Each cell is then filled with how many times a particular word appears in a given review (often zero).

While useless to humans, that 2 million by 200,000 matrix helps the computer determine which words are used with frequency, which words are used in tandem, et cetera. With a little semantic help from programmers, the process of determining which keywords appear to hold importance becomes relatively straightforward.

However, it would be impossible for a personal computer to run through those 2 million reviews without crashing or taking several weeks. The solution is to take a chunk of that and split into subsets to run in parallel over a few personal computers. They took samples of 200,000 reviews, 100,000 reviews and 20,000 reviews and split them into subsets of a fraction of those samples. For example, the 200,000 reviews were split into subsets of 50,000, 40,000, 30,000, and 20,000.

While not ideal, the result ended up being fairly accurate for high-importance words such as location, friendly, excellent, and helpful. The correlation between the value of those words from the subset and the whole 2 million review database neared 1, especially for the largest sample and subset combination.

Ultimately, the goal is to make the dataset and the subsets as large as possible. This makes intuitive sense as the most accurate representation is to run through all the data (impossible in this case) while the least accurate would be to use subsets of one review each. The combination of a 200,000-review sample with 50,000-review subsets ended up being most fruitful and manageable for a single personal computer to accomplish in a single day.

Beyond the actual analytics, the paper also went into interesting detail as to why garnering sentiment from a text sample is so difficult. For the most part, that difficulty stems from inconsistent user reports. For example, many hotel visitors who write in English do not natively speak English and end up misspelling words. While a simple spell-check is terrific for someone writing a 500-word article, it can produce some unwanted results when run through 38 million words over 2 million reviews.

Either way, this paper shows that running text analytics on large datasets may be accessible to small hotel business and not just for those with supercomputers.

Related Articles

Research Aims to Automate the Impossible

The Algorithmic Magic of Trendspotting

Spelunking Shops and Supercomputers