The 2013 Big Data Year in Review
While still in its infancy, the big data technology trend has made a lot of substantial progress since it gained traction at the beginning of this decade. The year 2013 was a big year with advances being made in virtually every quarter of the space. In this feature, we take a look at some of the significant trends that have crossed our desks in the past year — wrapped up and presented to you with a pretty bow. Out with the old, in with the new — it’s the Datanami 2013 Year in Review!
Hadoop 2–Beginning a New Era in Big Data
When Hadoop started catching on several years ago, nearly everything was done through the batch-oriented MapReduce framework. That was all well and good when you had lots of time to get an answer to your question. But if you needed an answer in a short amount of time, or had a question that didn’t map well to MapReduce, then Hadoop just wasn’t the right place to be asking those questions.
The characterization of Hadoop as a one-trick pony changed in October, when the Apache Software Foundation announced the general availability of Hadoop version 2. The new version was such a big deal because it formally introduced YARN, which made it much easier for users to run other processing engines–such as HBase, SQL, graph analytics, and stream processing– on their Hadoop clusters right beside their traditional MapReduce workloads.
“With the YARN scheduler in there, we can now better intermix different kinds of workloads and better utilize resources when these loads are inter-mixed,” Doug Cutting, the creator of Hadoop and chief architect at Cloudera, told Datanami in October. “So we’ll be able to get graph processing in here along with SQL and other things, and have them effectively sharing memory, CPU, disk, and network I/O more effectively.”
Hortonworks, which sticks to the open source version of Hadoop, was the first Hadoop software company to ship a distribution based on the version 2.x codebase with the launch of HDP 2.0 in late October. It was followed by Cloudera with its CDH5 offering, and Intel with its Intel Distribution for Apache Hadoop version 3 announcement last week. Other Hadoop vendors, including MapR Technologies, Pivotal, and IBM, are expected to ship Hadoop distributions based on version 2.x next year.
Venture Capital Spigot Stays Open
The year 2013 was one for big money aimed at the array of database software offerings in the market, be they Hadoop or other. Funds came flowing in from every direction as VC firms jockeyed for position to claim their share of the swelling big data market – a market that IDC predicts will grow to $32.4 billion by 2017.
The money came in so fast and furiously this year that Price Waterhouse Cooper and the National Venture Capital Association reported that investments in the software industry amounted to $3.6 billion during the third quarter of 2013 – a quarterly number that hasn’t been reached since 2001 (prior to the collapse of the dot com bubble).
One of the biggest funding splashes made this year was by MongoDB (formerly known as 10gen), which raked in an astonishing $150 million for its popular NoSQL database known for its relative ease of use. They were far from being alone in the field of NoSQL money takers.
There were others in the NoSQL space who took in some impressive hauls. DataStax, took in $45 million for its implementation of Cassandra. Couchbase wheel barrowed $25 million into its vault for its campaign to win the game in the key value and document database arena. MarkLogic took in $25 million for its NoSQL push towards a semantic future. And late entrant FoundationDB took in $17 million for its vision of an ACID compliant NoSQL database.
If that wasn’t impressive enough, major purveyors of the Apache Hadoop database took in considerable dollars in the last year, heating up the Hadoop wars considerably. Hortonworks made a big splash on the eve of Hadoop World this year with a $50 million dollar shot in the trunk that shook the ground. Rival vendor MapR Technologies closed a $30 million dollar round this past spring, giving it fresh legs in the Hadoop race. And while it was announced in December of 2012, the impact of Cloudera’s $65 million boost was certainly felt in 2013.
While this lower part of the stack has received a lot of love from the VCs in 2013, the year 2014 could be a different story as the focus starts to shift up the stack towards the companies who are developing applications that leverage these technologies. It should be interesting to see how the money shakes out among the up-level applications over the next year.
Startups Galore in the Big Data Space
Driven by a surge in venture capital funding and an improving economy, 2013 was another big year for startups in information technology. Startups were active in many industries, but the startup activity was especially vigorous in the big data space, reflecting both the immense promise and hype that is pervasive.
Among the startups making headlines this year was Metric Insights, a San Francisco software company that develops easier-to-use dashboards. Metric Insights (which technically was founded in 2010) took top honors at the Startup Showcase event during the recent Strata + Hadoop World conference.
Other big data software firms highlighted at the Startup Showcase included Sqrrl, a 2012 startup that built a NoSQL-based security product based on technology that emerged from the NSA (yes, that NSA); cloud-based data warehouse provider Appuri, founded in 2012; and Affinio, which develops a NoSQL-based graph engine. Alpine Data Labs (founded 2010) also made news with its iPad resident data analytics client.
Data cleansing startup Trifacta came out of stealth mode about a year ago, and expects to launch its first product in 2014. Paxata–another company chasing the data cleanliness problems that are endemic in big data initiatives–also came out of stealth mode this year. Zoomdata, meanwhile, got $4.1 million this year to pursue development of its data visualization platform for historical and real-time data.
Seeq was founded this year with $6 million in seed money to fund development of its big data analytics for industrial processes. Speaking of industry, General Electric plunked down $105 million for a piece of Pivotal, the Hadoop distributor that EMC spun out in early 2013. Cloud analytics vendor Numerify came out of stealth mode earlier this year after getting $8 million, which it will use to launch its service in early 2014. That’s the same timeframe that Cooladata, a cloud analytics startup based in Israel, expects for the launch of its service.
Edward Snowden Reveals Uncle Sam’s Big Data Secrets
Former NSA contractor Edward Snowden dropped a bombshell on the world earlier this year when he disclosed that the U.S. government is collecting, storing, and analyzing massive amounts of data about nearly everybody on earth.
Snowden started his big data dump with revelations about PRISM, a previously classified data mining program that gathers data about communications (including phone, email, and text traffic) from Internet companies that house the communication systems, including Google, Microsoft, Apple, Facebook, and AOL.
The former CIA employee, who is currently in hiding in Russia, followed that disclosure up with revelations about more government programs, such as XKeyscore, Tempora, the interception of US and European telephone metadata, and eventually, the bugging of German chancellor Angela Merkel’s personal cell phone.
Snowden’s revelations raised numerous questions about how governments ought to proceed in the era of big data. The US Government says the programs are necessary to maintain security and to prevent terrorist attacks. However, opponents of the programs say snooping on regular Americans is a clear violation of the Fourth Amendment. In December, the snooping programs were dealt a blow when a Federal Court judge ruled that some of the NSA’s intelligence programs are unconstitutional.
Pivotal and Intel Join Hadoop Arms Race
The year 2013 saw the entrance of two weighty players onto the Hadoop field with Intel and EMC’s Pivotal.
Intel’s announcement came in February as the company signaled that it was no longer content to sit on the sidelines. The company started distributing its own flavor of Apache Hadoop, bringing its deep knowledge of hardware and software optimization to bear on the platform. Saying that most distros in the space aren’t properly optimized to take advantage of the hardware, Intel started by optimizing Xeon, claiming at time of launch a 20x improvement on standard deployments.
In April, EMC’s Pivotal spun out into its own, launching with a $105 million investment from GE, and elevating its Pivotal HD distribution of Hadoop. The plan for GE is to leverage Pivotal to power its vision for an “industrial Internet” where machines are intelligent and connected.
Pivotal says that the new, emerging technology environment fundamentally turns every company on its head, making every one a software company. “In order to compete in the 21st century, everybody needs to be a software company,” Pivotal’s vice president of data platform product management Josh Klahr told an audience at the recent Strata + Hadoop World conference this year. As a heavily funded AWS challenger, Pivotal aims to capitalize on their vision of this new world.
While 2013 saw the entrance of new Hadoop distros, 2014 could bring the opposite results. We aren’t expecting to see any new commercial entrants in the Hadoop at this late stage in the game Rather, it’s not likely we’ll have as crowded a marketplace by this time next year.
SQL – Everything Old is New Again
Relational database technologies have taken a pounding over the last several years as the new, leggy NoSQL databases have shown off their flexibility and scalability. But as we’ve seen the beginning of the emergence of NoSQL technologies into the mainstream, the irony is that SQL will be a key technology bridging to the so-called “NoSQL” database future.
In 2013, the term “NoSQL” has become somewhat ironic, and in some ways comical as some in the fledgling space have tried to walk back the “No” in “NoSQL” to mean such things as “not only SQL” or “not yet SQL.” The branding issue aside, the purveyors of these NonREL (anyone?) database technologies rushed in 2013 to get their own version of the familiar query language built into their respective platforms – and those who haven’t done it yet just seem late to market at this point. The reason for this is very simple: virtually every developer knows SQL.
Hadoop has been no exception. One of the most notable efforts in the back-to-the-future push towards SQL on Hadoop was Cloudera’s launch of Impala – an MPP query engine designed to exist beside MapReduce, giving Cloudera users real-time SQL querying capabilities. The Impala announcement followed EMC’s February release of its Pivotal HD Hadoop distribution, which launched with Greenplum’s SQL enabler, HAWK.
Not to be left in the SQL shadows, Hortonworks announced the launch of the Stinger initiative in February, aiming to make SQL-enabler Apache Hive 100x faster than it was before. It’s a project still in process, with considerable progress. Meanwhile, competitor MapR has covered SQL with partnerships, such as with Hadapt and the recently announced partnership with Splice Machine. Additionally, MapR has human resources invested into the Apache Drill project, currently in alpha.
Meanwhile, Facebook, traditionally one of the biggest leaders in the movement to shun relational technologies, has done an about face and has started to publically embrace SQL and relational technologies. “I am not even sure you want to get away from SQL,” Facebook analytics chief, Ken Rudin told Enterprise Tech’s Timothy Prickett Morgan last month. “For the types of questions it is good at answering, it is the best way of answering those questions that I have seen so far. This notion of no SQL really took hold, and I hope that we as an industry are really over it. I have never seen anyone that is not using Hive with Hadoop, and that is SQL. Yes, it is converting it to MapReduce, but it is still SQL.”
Rise of Real Time
Hadoop’s popularity today can largely be attributed to the batch-oriented MapReduce paradigm. But many people in the big data industry recognize that processing historical data in multi-day increments only gets you so far. To really leverage and capitalize on big data collections, you need to be able to process and respond to data in real time.
TIBCO CEO Vivek Ranadivé built a business and a career on his ability to quickly route information to the person or application best able to take advantage of it, the so-called “two second advantage” that is also the name of his book. “If you have just a little bit of the right information a couple of seconds or minutes in advance, it’s more valuable than all of the information in the world six months after the fact,” Ranadivé told Datanami.
Figuring out exactly which pieces of data are going to help you achieve your big data objectives–whether it’s identifying potential new customers and upsell opportunities, preventing existing customers from defecting, or fingerprinting fraudulent transactions–is not easy. That is the promise and the challenge of big data in a nutshell.
In 2013, we saw a mix of new technologies and old approaches brought to bear on this challenge. Streaming data engines for Hadoop, such as Storm and S4, became more popular. Amazon released its Kinesis streaming data engine, and Yahoo gave us Samoa to serve as an overarching framework for Storm, S4 and whatever comes next. Cloudera introduced Impala to use tried-and-true SQL to get at data stored in HDFS, while there was a big move to embed traditional search engines, such as Lucene, embedded into Hadoop.
Meanwhile, the in-memory bandwagon got a little bigger, as front-end visualization products like Tableau, Qlikview, TIBCO’s Spotfire, and MicroStrategy’s new Analytics Express solidified themselves at the top of the real-time stack. On the NoSQL and NewSQL database front, we saw an increased emphasis on speed, scalability, and the capability to process transactions. Databases like MongoDB, HBase, and Cassandra are being called upon to be the real-time servers that act upon and monetize the insights delivered by deep analytic services hosted in Hadoop.
Hadoop as Enterprise Data Hub
In October, when Cloudera revealed its new strategy for transforming Hadoop into an “enterprise data hub” at the Strata + Hadoop World conference, it surprised almost nobody. Instead, it met with general approval and collective head nodding that, yes, this is obviously where Hadoop is headed and where it needs to go.
The transformation of Hadoop into an enterprise data hub is tied up with several other technology and product-oriented trends we saw unfolding in 2013, namely the rising popularity of data engines that aren’t named “MapReduce” into the Hadoop framework and the release of Hadoop version 2 and YARN. This brings us interactive SQL (Cloudera Impala/Hive), NoSQL data stores (HBase), streaming data (S4, Storm, Samoa), search (Lucene), machine learning (Mahout), and graph engines (Giraph).
There is little doubt in any big data peoples’ minds that Hadoop needs to be bigger than MapReduce to succeed. If Hadoop version 1 was all about figuring out and capitalizing on MapReduce, then Hadoop version 2 is about diversifying the pool of available processing engines and expanding the raw capabilities of Hadoop clusters. Doug Cutting, the father of Hadoop, even went out on a limb at Strata + Hadoop World and predicted that online transactional processing (OLTP) workloads will eventually run on Hadoop. Perhaps this is Hadoop version 3.
But the enterprise data hub strategy goes beyond processing engines, algorithms, and OLTP. Instead, what Cloudera articulated for the rest of the industry was a business plan that repositions and transforms Hadoop from a platform to run your data analytics workloads into a platform for all of your workloads. In Cloudera’s vision, all data flows through Hadoop, and therefore all software vendors need to tap into the framework if they want data.
The combination of big data technologies, cheap cloud processing, and ubiquitous mobile computing is rapidly moving the IT industry towards a cross-roads that will determine the future of technologies and the business plans of IT vendors. Hadoop turned the problem of the expense of moving data on its head by proposing that application code should be moved instead. This is exactly the sort of relatively simple but transformative idea that can upend a $3 trillion industry.
With its enterprise data hub vision, Cloudera simply extrapolated this idea into an enterprise business plan. It’s not surprising that Cloudera’s competitors in the Hadoop space have embraced some version of the enterprise data hub strategy, because it just makes sense. However, some vendors, like Hortonworks, are wary about Cloudera’s enterprise ambitions and worry about what it will mean to them and their business models.
In the end, enterprise data hub is a no brainer for big data types. It puts Hadoop squarely in the middle of the middle of enterprise IT, where it can have its greatest impact. In his Strata + Hadoop World presentation, Cutting said he had expected there to be multiple systems like Hadoop at this point in time. Well, there aren’t. Hadoop is the best and only game in town when it comes to handling a petabyte of data without totally breaking the bank. It is on its way to becoming “the kernel of the de facto standard operating system for big data,” as Cutting said, and there doesn’t appear to be anything to get in its way.
The Rise of Graph Analytics
With big data still very much in the sandbox experimental mode, graph analytics has become one of the go-to applications for business analysts to test drive and try to derive some form of actionable insights from the relationships in their data. This has led to some interesting developments in this fast rising space in 2013.
One of the more interesting developments came in August when Facebook announced that it had contributed a major code injection into the Apache Giraph project. Using this supercharged version of Giraph, Facebook claimed that it had stretched its capabilities past a trillion edges.
Giraph, an open source project based on Google’s Pregel system, looks to be a promising analytics system, especially given Facebook’s stamp of approval. With the emergence of Hadoop 2 and YARN, it’s not hard to imagine seeing a commercial rush to capitalize on the work that has gone into the framework.
If Giraph vendors rise, they’ll face some stiff competition. The year 2013 saw the funding of graph analytics vendors, including GraphLab, which took in $6.75 million in Series A funding in May, and Ayasdi, which hauled in an impressive $40.6 million in 2013. Meanwhile, Cray’s YarcData has been pushing into biosciences with Urika, its graph based big data appliance.
Adding more fuel to the graph fire, Intel announced this month that it’s adding graph capabilities to its Hadoop stew. Adding it as part of its Intel Distribution for Apache Hadoop 3.0, Intel is aiming its graph infused Hadoop distro at retail customers who want to uncover cross-sell and up-sale opportunities.
As the big data technology trend matures another year, the graph analytics part of the trend will be very interesting to watch in 2014.
Cloud Services Proliferate in 2013
Every rocket ship needs a launching pad, and for the big data technology trend, cloud computing appears to be just that. There was so much cloud news this year that it’s difficult to know where to even start when trying to examine what happened in 2013. Cloud offerings have become table stakes for every enterprise software vendor in the universe, and the big data arena is no exception.
One of the more notable cloud-based developments of 2013 happened when IBM announced that they would be making Watson available as a development platform available as a service. As a cloud-based natural language analytics service, IBM sees big things in Watson’s future with applications ranging from the classroom, to the health clinics, libraries, cities, retail outlets, and more. The year 2014 promises to be one where we start to see more diverse and interesting applications of Watson take root.
You can’t really talk about the cloud without focusing on Amazon, which has become the virtual center of the big data cloud universe. At its AWS re:Invent 2013 conference this year, Amazon unveiled a host of new and notable third party additions to its AWS service, including Splunk, MarkLogic, Syncsort, and its own NoSQL offering DynamoDB. That’s not even scratching the surface of what happened in 2013 as Hadoop and NoSQL vendors rushed to make their offerings available to the AWS user base.
NoSQL database vendors weren’t the only ones getting in on the cloud action. Amazon also announced its new RDS for PostgreSQL offering to compliment its other relational offerings, including MySQL, SQL Server, and Oracle database.
As noted elsewhere in this feature, one of the chief leaders in the webscale big data movement, Facebook, has turned around on relational technologies, embracing them as the right tool for the right job. How does a company like Teradata, which had a rocky 2013, capitalize on news like this? One way is to offer its own cloud service, which the company announced it would be doing this past fall. The new service offering will start with a data warehouse as a service, and expand to discovery and data management offerings. The company says Netflix is already on board, and there are whispers of some significant others.
The cloud was a key player in the big data arena in 2013, and will continue to be as it eventually recedes into the background, turning from “cloud computing” to just “computing.”