Follow Datanami:
February 21, 2012

The Political Intrigue of Big Social Data

Nicole Hemsoth

Social media analytics are powering everything from big brand decisions to the monitoring of the cultural climate. Now, with the help of some key big data startups, the massive, constant rush of social data is being tapped to look for inaccuracies and trends in political races.

The value of social media in politics has certainly not been questioned in recent years as governments and individual candidates attempt to turn the social swell in their favor (or turn their masses away from the swell altogether). However, companies like SocialMatica are taking a different approach to using the one-two punch of big social data and semantic data analysis to approach political battles.

Earlier this year the startup, which launched in 2010, released in-depth social data analysis figures that identified he top political topics in the GOP race using their semantic intelligence engine.

The company was able to pin down the key subjects that were influencing the race and further, they looked at these subjects as they related to specific candidates to get a real-time view into what issues were appealing to what potential voters.

Perhaps more important, however, is what the analytics revealed about media coverage of political races—and what these sources might be missing when it comes to the “big picture” of political engagement.

According to SocialMatica CEO, Gary Hermansen, “What is most interesting here is the discrepancy between what traditional media outlets are touting as campaign hot topics and the actual topics of conversations taking place online. Despite the media hype and political buzz surrounding immigration, candidate income and even religion, online users are most concerned about topics closer to home – such as taxes and the economy.”

Hermansen continued, “I would argue that our newspapers and TV network have continued to focus on topics less pertinent to the American public – such as Newt Gingrich’s previous work with mortgage giant Freddie Mac and the details of Mitt Romney’s personal income tax filings. These outlets would benefit from focusing on topics considered compelling by people actively engaging in social media – such as taxes, the economy, military and education.”

To get behind the scenes of the company on a technical level, we caught up with the company’s CTO, Mary Harris, for a few questions about how the platform works.

Please describe your semantic data intelligence engine; what technologies power this platform and how, if at all, on a functional level, is this different from the other platforms available?

We have developed a platform for converting freeform web data into a structured relational format and created a set of proprietary tools that are optimized for performing this task as well as creating highly structured databases of vertical/subject-based content. In the SocialMatica world, a “vertical” is a set of Web resources and appropriate associated vocabularies that have been defined in collaboration with the client.

This set of resources describes a representative data set for any particular subject, such as “Automotive” or “Data Storage” or in this case the “Republican Primary Candidates”. A Vertical resource set is built and then fed to our Data Acquisition Controller System responsible for data collection. All of our tools are running in the cloud (on Rackspace) and are scaled to manage traffic and load.  Included are a set of high level diagrams to help in explaining the process and the tools.  The three major components of our system are: A: The Data Acquisition Controller, B: the SocialBase Builder, and C: the Data Processing Controller.

The SocialMatica Production Pipeline

The diagram below describes the high-level flow and process of the entire system. We collect web material, both various pages and API feeds, and using a three step process, convert these data items into both numerical analytics and analyzed conversations that we then display in any number of product views. We collect information on both People and Companies in any particular vertical, as well as the underlying and supporting blog/tweet/forum/linkedIn/FaceBook/News/YouTube/WebPage/etc information and build an integrated searchable relational DB.

The Data Acquisition Controller 

As the diagram below shows, this subsystem is responsible for collecting various types of web pages and managing access to several data api’s.  This subsystem includes a web crawler, and various types of page parsers. We do not rely on any RSS feeds to collect Blog, Forums, FaceBook, or News information. To support this functionality, we have developed a very sophisticated proprietary page scraper/parser which allows us to scrape and parse any text off of any type of page or resource.

The SocialBase Builder

The original goal of this sub-system was to build more complete data sets for our verticals. We realized early on in the process that we could build reasonably complete data records for the people and companies that we were tracking using only web-based data.

The diagram below highlights this subsystem’s function and basis:

The Data Processing Controller

This controller manages various text processing and analytic procedures which are combined to create our unified SocialMatica database.

Again, see below for a visual representation:

To answer your “how is this different” question, we have created and optimized the process and the tools to create verticals, not large general knowledge bases. As previously mentioned, all data and vocabularies are vertical specific, and all tools are optimized for this process.

How is this “semantic data intelligence” being leveraged to follow the GOP race? Please be technical in your answer—no generalities; we are looking for an answer on the algorithmic/hardware/middleware-framework level here.

We understand the influence of the people in any vertical by understanding who they talk about, who talks about them, and what topics they are talking about. We used our general vertical framework, which was designed to evaluate people and companies in a space, and modified it slightly to look at Candidates in place of companies.

We rank candidates and influencers in the space. We rank these candidates by the number of influencers that are talking about them and the rank, or importance of these influencers. We rank influencers by how much they write, how much they are commented on, how often they are mentioned, the importance of the publishing site, the number of their twitter followers and tweets, the number of twitter mentions and twitter re-tweets, the relevancy of their tweets and blog posts to this vertical, and the timeliness of their posts.

In addition to these attributes we can also use other attributes such as number of social connections, education level, etc. to modify the ranking. These attributes are weighted based on our analysis of the social space and what we believe is most important.

Does your company work with any enterprise customers in a capacity outside of marketing analytics? In other words, is your platform being leveraged for any mission-critical operations outside of marketing, say for more direct BI or other purposes?

 Yes we currently using our platform in support of research and business intelligence.