Language Flags

Translation Disclaimer

HPCwire Enterprise Tech HPCwire Japan

November 12, 2012

Film Industry to Turn to Wikipedia for Predictive Analytics?

Might Wikipedia be a better indicator of sentiment than Twitter? In the movie industry, the answer might be yes. That is, as far as projected box office revenue is concerned.

According to a paper published by researchers Marton Mestyan, Taha Yasseri, and Janos Kertez of the Institute of Physics at the Budapest University of Technology and Economics, data taken from the massive online encyclopedia serves as a greater long-term predictive indicator of box office revenue than the 140-character social media site.

Twitter is a gold mine of large scale social information, invaluable to marketers looking to assess the effectiveness of their campaigns. However, in almost all Twitter use cases, some form of text analytics has to be applied, a proposition which is neither simple nor consistent. Often, the Twitter problem involves sifting through large volumes of tweets related to a certain company in order to determine the public opinion, which is already quite data intensive.

Here the task involves determining which tweets are relevant to the movie, which takes away an element of sentimentality complexity but introduces the problem of determining relevance. On the other hand, someone going to a movie’s Wikipedia page is either definitively looking for information about that movie or, in a significantly less likely event, was helped there by the site’s random article function.

In this example with Twitter, the text analysis has to determine relevance. For example, anyone mentioning “Star Trek” in a tweet may be referencing the movie to come out next May. They may also be referencing any of the other eleven movies, five television shows, or the hundreds of books published under the same name. As a result, only a small percentage of that information will be useful. Meanwhile, someone going to the “Star Trek Into Darkness” page will more likely be seeing the movie.

Despite those limitations, Twitter ends up being a decent predictor of box office revenue. However, according to the paper, while the Twitter model at its peak is a more accurate indicator than the Wikipedia model at its peak, its effectiveness is limited to opening night. At that point, the movie is already released and any potential marketing campaign adjustments would have a significantly lower chance of being effective.

To conduct this study, the researchers compiled the 312 movies released in America in 2010 that had Wikipedia pages. They measured four variables: number of views, number of human editors, number of human edits, and collaborative rigor (a function of edits per editor; a higher rigor value means the edits are spread out among more humans).

The correlation was strong, hitting an R-squared value of 0.94 (an R-squared value of 1 would indicate perfect correlation) at its peak and, more impressively, a 0.925 value a month before the movie came out. While Twitter at its peak hit 0.98, that peak arrives sharply on the release date and hovered around 0 a month before release.

This makes sense. As people get excited about a movie or actually go see the movie, they tend to tell Twitter about it. On the other hand, searching for and reading a Wikipedia article about an upcoming movie makes sense a month before that movie’s release, as it represents a sense of curiosity that the Twitterverse simply cannot satisfy. As a source of a great deal of important information, including cast members, spoilers, and updates, a particular movie’s Wikipedia page should attract an amount of people roughly proportional to the amount that will actually go see it.

As the release date approaches, the importance of page views increases. The researchers built their model to reflect a much more complex version of that statement. According to the graph below, the model ended up being fairly accurate, with the majority of (visible) misses underestimating the film’s box office return.

While the paper noted that some results of extremely unpopular movies did not show up on the logarithmic scale, outliers are to be expected when the data from 300+ movies is taken.

All in all, this is good news for marketing departments in the film industry. As stated before, compiling Twitter data to provide a decent sentiment analysis is tricky, and may not be worth it anyway. On the other hand, it is relatively easy to track the amount of page views and edits a certain Wikipedia entry has. The tough part there is finding a norm from which to compare, as the researchers attained their data via Wikimedia Deutschland. It may not be possible for anyone who asks for it to copy that data.

It is also important to note that this metric measures the amount of people expected to see the movie, not the perceived quality. For that, Twitter may end up being the stronger tool.

With that being said, movie marketing may look past Twitter and to encyclopedic sites like Wikipedia and IMDB to estimate box office revenue a month in advance and adjust their strategy accordingly.

Related Articles

Are Predictive Analytics Overrated?

20 Lessons Enterprise CIOs Can Learn from Supercomputing

Twitter Flies by Hadoop on Search Quest

Share Options


» Subscribe to our weekly e-newsletter


There are 0 discussion items posted.


Most Read Features

Most Read News

Most Read This Just In


Sponsored Whitepapers

Planning Your Dashboard Project

02/01/2014 | iDashboards

Achieve your dashboard initiative goals by paving a path for success. A strategic plan helps you focus on the right key performance indicators and ensures your dashboards are effective. Learn how your organization can excel by planning out your dashboard project with our proven step-by-step process. This informational whitepaper will outline the benefits of well-thought dashboards, simplify the dashboard planning process, help avoid implementation challenges, and assist in a establishing a post deployment strategy.

Download this Whitepaper...

Slicing the Big Data Analytics Stack

11/26/2013 | HP, Mellanox, Revolution Analytics, SAS, Teradata

This special report provides an in-depth view into a series of technical tools and capabilities that are powering the next generation of big data analytics. Used properly, these tools provide increased insight, the possibility for new discoveries, and the ability to make quantitative decisions based on actual operational intelligence.

Download this Whitepaper...

View the White Paper Library

Sponsored Multimedia

Webinar: Powering Research with Knowledge Discovery & Data Mining (KDD)

Watch this webinar and learn how to develop “future-proof” advanced computing/storage technology solutions to easily manage large, shared compute resources and very large volumes of data. Focus on the research and the application results, not system and data management.

View Multimedia

Video: Using Eureqa to Uncover Mathematical Patterns Hidden in Your Data

Eureqa is like having an army of scientists working to unravel the fundamental equations hidden deep within your data. Eureqa’s algorithms identify what’s important and what’s not, enabling you to model, predict, and optimize what you care about like never before. Watch the video and learn how Eureqa can help you discover the hidden equations in your data.

View Multimedia

More Multimedia

Job Bank

Datanami Conferences Ad

Featured Events

May 5-11, 2014
Big Data Week Atlanta
Atlanta, GA
United States

May 29-30, 2014
St. Louis, MO
United States

June 10-12, 2014
Big Data Expo
New York, NY
United States

June 18-18, 2014
Women in Advanced Computing Summit (WiAC ’14)
Philadelphia, PA
United States

June 22-26, 2014

» View/Search Events

» Post an Event