Big Data • Big Analytics • Big Insight

November 12, 2012

Film Industry to Turn to Wikipedia for Predictive Analytics?

Ian Armas Foster

Might Wikipedia be a better indicator of sentiment than Twitter? In the movie industry, the answer might be yes. That is, as far as projected box office revenue is concerned.

According to a paper published by researchers Marton Mestyan, Taha Yasseri, and Janos Kertez of the Institute of Physics at the Budapest University of Technology and Economics, data taken from the massive online encyclopedia serves as a greater long-term predictive indicator of box office revenue than the 140-character social media site.

Twitter is a gold mine of large scale social information, invaluable to marketers looking to assess the effectiveness of their campaigns. However, in almost all Twitter use cases, some form of text analytics has to be applied, a proposition which is neither simple nor consistent. Often, the Twitter problem involves sifting through large volumes of tweets related to a certain company in order to determine the public opinion, which is already quite data intensive.

Here the task involves determining which tweets are relevant to the movie, which takes away an element of sentimentality complexity but introduces the problem of determining relevance. On the other hand, someone going to a movie’s Wikipedia page is either definitively looking for information about that movie or, in a significantly less likely event, was helped there by the site’s random article function.

In this example with Twitter, the text analysis has to determine relevance. For example, anyone mentioning “Star Trek” in a tweet may be referencing the movie to come out next May. They may also be referencing any of the other eleven movies, five television shows, or the hundreds of books published under the same name. As a result, only a small percentage of that information will be useful. Meanwhile, someone going to the “Star Trek Into Darkness” page will more likely be seeing the movie.

Despite those limitations, Twitter ends up being a decent predictor of box office revenue. However, according to the paper, while the Twitter model at its peak is a more accurate indicator than the Wikipedia model at its peak, its effectiveness is limited to opening night. At that point, the movie is already released and any potential marketing campaign adjustments would have a significantly lower chance of being effective.

To conduct this study, the researchers compiled the 312 movies released in America in 2010 that had Wikipedia pages. They measured four variables: number of views, number of human editors, number of human edits, and collaborative rigor (a function of edits per editor; a higher rigor value means the edits are spread out among more humans).

The correlation was strong, hitting an R-squared value of 0.94 (an R-squared value of 1 would indicate perfect correlation) at its peak and, more impressively, a 0.925 value a month before the movie came out. While Twitter at its peak hit 0.98, that peak arrives sharply on the release date and hovered around 0 a month before release.

This makes sense. As people get excited about a movie or actually go see the movie, they tend to tell Twitter about it. On the other hand, searching for and reading a Wikipedia article about an upcoming movie makes sense a month before that movie’s release, as it represents a sense of curiosity that the Twitterverse simply cannot satisfy. As a source of a great deal of important information, including cast members, spoilers, and updates, a particular movie’s Wikipedia page should attract an amount of people roughly proportional to the amount that will actually go see it.

As the release date approaches, the importance of page views increases. The researchers built their model to reflect a much more complex version of that statement. According to the graph below, the model ended up being fairly accurate, with the majority of (visible) misses underestimating the film’s box office return.

While the paper noted that some results of extremely unpopular movies did not show up on the logarithmic scale, outliers are to be expected when the data from 300+ movies is taken.

All in all, this is good news for marketing departments in the film industry. As stated before, compiling Twitter data to provide a decent sentiment analysis is tricky, and may not be worth it anyway. On the other hand, it is relatively easy to track the amount of page views and edits a certain Wikipedia entry has. The tough part there is finding a norm from which to compare, as the researchers attained their data via Wikimedia Deutschland. It may not be possible for anyone who asks for it to copy that data.

It is also important to note that this metric measures the amount of people expected to see the movie, not the perceived quality. For that, Twitter may end up being the stronger tool.

With that being said, movie marketing may look past Twitter and to encyclopedic sites like Wikipedia and IMDB to estimate box office revenue a month in advance and adjust their strategy accordingly.

Related Articles

Are Predictive Analytics Overrated?

20 Lessons Enterprise CIOs Can Learn from Supercomputing

Twitter Flies by Hadoop on Search Quest