Translation Disclaimer

October 30, 2012

## Bashing Big Text Data with Simple PCs

Big data does not always come in easily analyzed numerical forms. Computers have almost always been better than humans at mathematics at any scale.

However, the job of determining whether a hotel review, for example, is positive or negative comes naturally to a person but less so to a computer. More and more small businesses need to be able to analyze millions of reviews in order to gain valuable customer feedback.

The problem is, not all companies have access to Hadoop clusters or supercomputers. For other companies, the return may not be worth the investment if the only goal is to garner customer sentiment. A paper written by Jan Zizka and Frantisek Darena of Mendel University in Brno, Czech Republic examines this problem. The question the paper asks is simple: how accurate can simple personal computers get relative to supercomputers or networks of computers with regard to text analytics?

Specifically, Zizka and Darena analyzed 2 million hotel reviews to find the words of high importance, such as location (ended up being always important), helpful, and excellent among others. This was done by creating a matrix of the 2 million reviews on one axis and the 200,000 words of English used to create those reviews. Each cell is then filled with how many times a particular word appears in a given review (often zero).

While useless to humans, that 2 million by 200,000 matrix helps the computer determine which words are used with frequency, which words are used in tandem, et cetera. With a little semantic help from programmers, the process of determining which keywords appear to hold importance becomes relatively straightforward.

However, it would be impossible for a personal computer to run through those 2 million reviews without crashing or taking several weeks. The solution is to take a chunk of that and split into subsets to run in parallel over a few personal computers. They took samples of 200,000 reviews, 100,000 reviews and 20,000 reviews and split them into subsets of a fraction of those samples. For example, the 200,000 reviews were split into subsets of 50,000, 40,000, 30,000, and 20,000.

While not ideal, the result ended up being fairly accurate for high-importance words such as location, friendly, excellent, and helpful. The correlation between the value of those words from the subset and the whole 2 million review database neared 1, especially for the largest sample and subset combination.

Ultimately, the goal is to make the dataset and the subsets as large as possible. This makes intuitive sense as the most accurate representation is to run through all the data (impossible in this case) while the least accurate would be to use subsets of one review each. The combination of a 200,000-review sample with 50,000-review subsets ended up being most fruitful and manageable for a single personal computer to accomplish in a single day.

Beyond the actual analytics, the paper also went into interesting detail as to why garnering sentiment from a text sample is so difficult. For the most part, that difficulty stems from inconsistent user reports. For example, many hotel visitors who write in English do not natively speak English and end up misspelling words. While a simple spell-check is terrific for someone writing a 500-word article, it can produce some unwanted results when run through 38 million words over 2 million reviews.

Either way, this paper shows that running text analytics on large datasets may be accessible to small hotel business and not just for those with supercomputers.

Related Articles

Research Aims to Automate the Impossible

The Algorithmic Magic of Trendspotting

Spelunking Shops and Supercomputers

#### Subscribe

» Subscribe to our weekly e-newsletter

### Discussion

There are 0 discussion items posted.

### In the Spotlight

New Enterprise Economics of Storage: Cloud in Your NAS
The explosion of data collection morphed into not only “big data,” but big unstructured data. According to an IDC Digital Universe published in late 2011, some 1.8 zettabytes of data would be created just that year.  The rate of growth is expected to continue to climb, increasing exponentially year after year. Enterprise Strategy Group’s 2013 survey concluded that “rapid growth and management of unstructured data” was the most commonly named primary storage challenge among respondents. This data growth is putting a strain on IT infrastructure. Read more...

#### Slicing the Big Data Analytics Stack

11/26/2013 | HP, Mellanox, Revolution Analytics, SAS, Teradata

This special report provides an in-depth view into a series of technical tools and capabilities that are powering the next generation of big data analytics. Used properly, these tools provide increased insight, the possibility for new discoveries, and the ability to make quantitative decisions based on actual operational intelligence.

#### Breaking I/O Bottlenecks

10/30/2013 | Cray, DDN, Mellanox, NetApp, ScaleMP, Supermicro, Xyratex

Creating data is easy… the challenge is getting it to the right place to make use of it. This paper discusses fresh solutions that can directly increase I/O efficiency, and the applications of these solutions to current, and new technology infrastructures.

View the White Paper Library

#### Video: Using Eureqa to Uncover Mathematical Patterns Hidden in Your Data

Eureqa is like having an army of scientists working to unravel the fundamental equations hidden deep within your data. Eureqa’s algorithms identify what’s important and what’s not, enabling you to model, predict, and optimize what you care about like never before. Watch the video and learn how Eureqa can help you discover the hidden equations in your data.

#### HPCwire Live! Atlanta's Big Data Kick Off Week Meets HPC: What does the future holds for HPC?

Join HPCwire Editor Nicole Hemsoth and Dr. David Bader from Georgia Tech as they take center stage on opening night at Atlanta's first Big Data Kick Off Week, filmed in front of a live audience. Nicole and David look at the evolution of HPC, today's big data challenges, discuss real world solutions, and reveal their predictions. Exactly what does the future holds for HPC?

More Multimedia

### Featured Events

February 11-13, 2014
Strata Conference - Making Data Work
Santa Clara, CA
United States

February 18-18, 2014
Data Science Innovation Summit
San Diego, CA
United States

March 9-12, 2014
SXSW 2014
Austin, TX
United States

March 19-20, 2014
GigaOM Structure Data
New York City, NY
United States

March 31, 2014 - April 2, 2014
Big Data TechCon
Boston, MA
United States

» View/Search Events

» Post an Event