Too many big data initiatives are science projects that take months of effort, risk failure and require highly trained data scientists with scarce skills. According to a CSC survey, 55 percent of big data projects aren’t completed and many others fall short of their objectives.Read more...
How Data Analytics is Shifting Politics
Now that the U.S. presidential campaign is over, the Democratic National Committee is starting to reveal some of the factors that led to their tech advantage.
DNC Director of Architecture Chris Wegrzyn and HP’s Chris Selland recently took in-depth look at the data operations that were steadily crunching away behind the scenes well before the polls closed on this year’s election.
According to Obama Campaign Manager Jim Messina, their campaign’s data analytics operation was one of their strongest advantages over the Romney campaign. We touched on that in a previous article, noting how big data analytics sparked efficient resource allocation, particularly when it came to volunteer placement and advertisement purchasing.
Unlike companies, who can plan their big data operations around long-term viability and profitability, the presidential campaign had to ensure their systems were fully operational quickly. Wegrzyn noted that they had to figure out how to both raise and spend a billion dollars in the most efficient manner over the course of 18 months. Meanwhile, a company that invests in a Hadoop cluster or something like HP Vertica could spend months on running research tests before fully implementing it on an enterprise scale.
Options such as Hadoop were attractive but ultimately not what they were looking for. “We used [Hadoop], we loved it, but it wasn’t going to be this central analyst platform,” Wegrzyn said in describing the process behind selecting a data management system. The DNC wanted to play to a strength: the campaign had a decent pool of ‘smart people’ from which to draw. As such, they needed a system that could be quickly and easily learned.
Wegrzyn then turned to SQL. “SQL databases had a simple model that people already knew or we could teach people easily, it was designed for performance to minimize tinkering for speed and it had a clear scalability path.”
Next was identifying a vendor and a system to manage their SQL databases. Again, the specific needs and resources of their campaign drove their decision. Since they were looking for quick decisions made on datasets compiled over a relatively short period of time, performance was going to be a bigger issue than data storage. As such, appliance cost models they were getting from certain vendors were incompatible. “For us, vendors that were offering appliance cost models didn’t really make a lot of sense for us. We felt like we were going to need performance before we were going to need storage.”
Vertica ended up doing well in the campaign’s proof-of-concept models, but the tipping point lay in a shared vision between the two organizations. “The one that really tipped the scale was that Vertica had this roadmap that we felt was aligned with this idea of an analyst-driven organization.”
Through Vertica, the campaign was unexpectedly able to connect their digital and field operations. This added bonus, which turned into a system they would call ‘AirWolf’ came about as a result of connecting all their databases to the Vertica framework. After the 2008 campaign, it was thought that those databases were perhaps too vast to connect. But through a tool they developed called ‘Stork,’ the analysts were able to combine the databases. “We built a tool we called Stork which basically let Vertica serve as the center for not just our analyst operations but for how we interacted with the entire campaign.”
This comprehensive integration was built atop Vertica by the analysts and engineers using it, a process facilitated by, according to Wegrzyn, the system’s relative simplicity. “We started with just raw data and we wrote a system that allowed us to move data from various different databases into our Vertica system on a regular basis…We built a platform on top of Vertica that in short was a glorified SQL runner and scheduler.”
Once they were integrated, analysts were able to alert field organizers about online registers in their respective areas, allowing the organizers to cast a net targeting those likely to volunteer.
The philosophy of the campaign, according to Wegrzyn, was that nothing was to be assumed. Assumptions generally made in campaigns based on “what makes sense” were to be eschewed in favor of data-driven conclusions. As noted in the afore-mentioned article, the campaign would make unusual ad buys (such as Walking Dead) to target specific, niche voter groups.
To figure out which programs they would buy, they would gather their data collected from unspecified vendors over Vertica. That data was heavily demographic in nature, going beyond “women ages 20-29,” as Wegrzyn put it. For example, they wanted to target young voters who were likely supportive of the president but not necessarily driven to vote. They were then able to match that with pricing models and make informed, usually-cheaper cable buys that reach the younger, arguably more apathetic populace.
As a result, they turned part of their ad campaign into an extended get-out-the-vote movement.
Again, it should be kept in mind that this type of platform was useful to people who required simplicity and the ability to run analytics with as little development time as possible. Storage was not much of an issue for a system that was to only be in use for a year. From a short-term perspective, Vertica did well for the campaign, and will likely at least serve as a model for 2016.