Big Data: Beneficial or Bunk?
If there’s one guy who truly recognizes the power of big data, you’d think it would be the CTO of President Obama’s 2012 re-election campaign, whose expert application of big data technologies is widely credited with helping his boss keep his job. So when Harper Reed recently called big data “bul***it,” it raised a few eyebrows.
|Harper Reed was the CTO of President Obama’s 2012 re-election campain.|
Just before Halloween, Reed surprised his audience at a State University of New York conference with a few choice words about the big data phenomenon. “Big Data is bul***it,” Reed said during a keynote speech, as reported by The Chronicle of Higher Education. “The ‘big’ there is purely marketing,” he continued. “This is all fear … This is about you buying big expensive servers and whatnot.”
Reed continued with his “intervention” for the big-data loving crowd assembled before him in Lower Manhattan. “The exciting thing is you can get a lot of this stuff done just in Excel,” he said according to The Chronicle. “You don’t need these big platforms. You don’t need all this big fancy stuff. If anyone says ‘big’ in front of it, you should look at them very skeptically … You can tell charlatans when they say ‘big’ in front of everything.”
Nobody is claiming that Excel was the campaign’s secret re-election weapon. In fact, the team’s feats of analytical wonder–enabled by a $1-billion budget and an IT staff of 165–have been well documented, including right here at Datanami. Campaign manager Jim Messina is on record explaining how the IT team built a model to analyze the political proclivities of every American voter, and simulated how things might play out with 62,000 simulations over the course of the campaign. You simply don’t do that with Excel running on a Dell Inspiron.
But Reed was making a valid point about the level of hype and marketing in the big data space. There is an unhealthy fascination aimed at new technologies like Hadoop and the NoSQL databases, which can be quite useful in some cases, but which are clearly overkill in many situations.
“I enjoyed hearing my friend Harper claim that ‘Big Data is Bul***t,” writes Lukas Biewald, the CEO of Crowdflower, in a recent blog post. “I remember meeting with him as he worked with me and other Silicon Valley tech CEOs asking for techniques to deal with the large databases of voter and donors the Obama campaign was dealing with. Even then I thought that many of the proposed solutions were overkill for the actual size of the data sets he was dealing with. It’s sexier to advise someone to use MongoDB or Hadoop than to tell them their problem can be solved with a few lines of Python. So I’m not surprised that he feels like ‘Big Data’ is a bunch of BS.”
Vivek Ranadivé, the CEO of TIBCO, warns against falling for big data marketing hype. “I would agree partially with him [Reed] that there are a lot of 20th century companies and they would have you believe that big data means big database,” Ranadivé told Datanami this week. “He’s absolutely correct that they want you to buy big machines to house big databases. But that’s basically the old way of thinking. Companies like Oracle and IBM would have you think that.”
But Ranadivé disagrees with Reed when it comes to the validity of big data being a real phenomenon. “So you really have to be careful that you don’t confuse big data with big database and big hardware,” he says. “Because there are two types of big data. There’s data at rest, about what happened historically, and there’s data in motion, which is real-time data. True insights come from extracting really the right pieces and bringing them together.”
TIBCO’s big data initiative is focused on extracting useful information from big data at rest, and then enabling companies to act on real-time signals. “So whether you’re trying to cure cancer or you’re trying to figure out who to sell hot dogs at a basketball game and at what price, or you’re trying to figure out what combinations of players will win you the game or you’re trying to find out when a customer is about to become unhappy or some kind of security threat or even that a plane is about to go down–those are all big data problems,” Ranadivé says.
Lonne Jaffe, the CEO of SyncSort, is another insightful software executive who’s skeptical about the marketing hype around big data, but who is nevertheless a true believer in the power of innovative technology.
“It’s a little hard to differentiate these days between people who are ‘big-data washing’ their products versus the real industrial-grade system that do something useful,” Jaffe tells Datanami. “Especially because there’s a lot of nuance and subtlety to the technology that allows disruption, particularly with people who don’t understand what the products actually do.”
Hadoop is a hugely disrupting technology at the moment, and it’s hard for chief marketing officers for software vendors not to jump on the bandwagon. “Sometimes they just add Hadoop to whatever they have existing, and they try to make their products appear more valuable,” Jaffe says. “I think that won’t work longer term. But the companies that are doing the real Hadoop work–they’re generating a lot of value.”
It’s really kind of sad when you see an older IT firm with big legacy franchises to protect espousing their allegiance to the new big data god. “There’s a lack of conviction in their investment,” Jaffe says. “There’s no level of success they could have in a Hadoop-based business that would ever come close to offsetting even a slightly accelerated decline in their existing franchise.”
These are interesting times in the $3-trillion IT industry, to be sure. So what is an IT manager at a $100 million manufacturer, or a $1 billion retailer, or a $10 billion insurance company, supposed to do? Biewald provides some good advice. The Crowdflower CEO breaks big data down into three levels of data.
- Level 1: Under 20,000 rows, or Data Sets That Can Be Opened In Excel.
“Everyone loves to complain about Excel, but it’s a great tool,” Biewald says. “The graphs are a little ugly and the statistical tests are a little strange, but being able to see and manipulate your data in an unstructured way is amazing.”
- Level 2: Under 2 million rows, or Data Sets That Fit Into RAM on a Single Machine
“If your data fits into memory on a single machine, there’s no need to use new-fangled tools like NoSQL databases or Hadoop,” he says. “At this scale, a reasonably configured MySQL or Postgres database is going to be just fine and make your life a lot easier. Remember that the 2,000,000 or so row limit of this stage goes up every year as computers get more and more RAM, so even if your quantity of data is doubling every 18 months, you may be able to stay at this stage forever, and if you can avoid going to level 3, you will be glad you did!
- Level 3: Above 2 million rows. A World of Pain
“Once your data doesn’t comfortably fit into memory on a single machine, things are going to get much harder,” he says. “This is where the “Big Data” tools start to make sense and where you want to hire specialized people to set them up and run them. Having run a Hadoop cluster in the early days of Powerset and dealt with data sets of this type at Stanford and Yahoo, I try to stay away from this size at all costs and I try to get my data out of this stage as fast as I can. Everything is trickier here, it’s hard to compute averages and look at what kinds of outliers you might have, and it’s easy to make dumb mistakes that would be obvious at smaller scales.”
Everybody seems to be marketing some kind of big data solution these days, which definitely dilutes the big data pool. But throwing out the big data baby with the bathwater would be a mistake. Big data (or whatever you want to call it) is the new marketing term du jour for sure, but it’s also a useful tool that reflects a new class of data processing problem and the solutions that are being created to deal with them. Don’t lose sight of that.