2016: A Big Data Year in Review
With another year almost in the books, and with 2017 looming just over the horizon, now is a good time to take stock of what happened in the big data analytics space over the previous twelve months, to assess where we’ve come from and what direction we may go.
It’s been an eventful year for big data, to say the least. Nobody knows what 2017 will bring, although that won’t stop us from sharing predictions from some of the brightest minds in big data (stay tuned). Here are some of the top news-making items, events, and trends that helped to shape 2016 and turn it into the big data year that was.
Fall of BI Leaders
When the high-flying BI and visualization tool vendor Tableau (NYSE: DATA) lost half its market capitalization in a single day of trading following disappointing financial results last February, it became abundantly clear that the BI market was in for a rocky ride for 2016. The carnage continued several months later when Qlik Technologies (NASDAQ: QLIK), which had lost more than half of its stock value, was acquired for about $3 billion by Thoma Bravo in June.
While Tableau and Qlik were (and still are) class-leading tools, their once-commanding leads have shrunk as a bevy of less-expensive and capable BI alternatives from the likes Microsoft (NASDAQ: MSFT), Microstrategy (NASDAQ: MSTR), Alteryx, Birst, Domo, Sisense, Gooddata, and others have emerged. Gartner, which recorded no fewer than 24 vendors in its Magic Quadrant for BI and Analytics platforms in 2016 (which didn’t even include impressive BI newcomer Zoomdata), says the market reached “a tipping point that requires a new perspective.”
Rise of AI
When Google DeepMind beat a human at the ancient game of Go, it became clear that we were witnessing artificial intelligence (AI) technology’s “Big Bang” moment, to borrow the words of an NVIDIA (NASDAQ: NVDA) product manager. From AI-powered chat bots like Siri and Alexa to self-driving cars, millions of consumers are on the cusp of realizing significant benefits from AI.
We saw numerous new AI services launched, too, including the new Amazon AI offering launched last month at the Web giant’s annual AWS re:Invent confab. And when UC Berkeley announced in October that the prolific AMPlab, which gave us big data technologies like Apache Spark, would be replaced by RISELab that will focus in part on AI and applications like self-driving cars, it gave more evidence to the notion that AI is engulfing and overtaking big data as a concept.
Hadoop Turns 10
Hadoop hit double digits on a day in late January 2016 that marked the 10-year anniversary of the first production Hadoop cluster at Yahoo. The Yahoo engineers–who were happy if that first 10-node cluster ran continuously for a full day–had no idea that Hadoop would turn into the poster child for big data computing and a staple in the IT shops of every Fortune 100 company.
Hadoop’s success surely exceeded the expectations of Doug Cutting, the Cloudera architect who co-created Hadoop with Mike Cafarella. At a wide-ranging talk at Strata + Hadoop World, Cutting wondered aloud whether we’ve reached “peak Hadoop”, and what the next 10 years of Hadoop might look like. Considering the collective yawn that emanated from the big data community about the development of Hadoop version 3–which will double the storage capacity and bring erasure encoding—and the relentless pace of big data tech’s evolution, it’s tough to say what Hadoop will look like in 2026, or if we’ll still be talking about it.
Apache Spark Rules
Hadoop’s open and economical approach to distributed computing surely captured the attention of tech pros who struggled to process large data sets using expensive, proprietary software. But if Hadoop’s Java-based star has started to dim of late, it’s being replaced by another with the potential to shine even brighter: Apache Spark.
Apache Spark’s meteoric climb up the big data ladder has been fun to watch, particularly as big vendors like IBM (NYSE: IBM) embrace it and just about every BI and visualization tool vendor use the in-memory technology under the covers to process batch, interactive, and streaming workloads. Some postulate that Spark will eventually surpass Hadoop in use and popularity, if it hasn’t already.
Enter Flink and Beam
Even as Spark has essentially replaced MapReduce as the batch processing engine in Hadoop—to say nothing of Spark’s SQL, machine learning, and stream processing boafides–the restless big data community is angling to improve on the versatile Scala-based framework backed by Databricks. Apache Flink and Apache Beam have emerged as the key competitors to Spark in the battle of the big data frameworks.
Cloudera’s Cutting tipped his hat to data Artisans‘ Flink project in March, when he said “Flink is architected probably a little better than Spark.” Meanwhile, Apache Beam, which is based on Google’s Cloud Dataflow API and is being championed by a French big data architect at Talend (NASDAQ: TLND), has emerged to with the ambitious goal of unifying all big data app development under a single API, with “runners” out to Spark, Flink, and Google Dataflow.
Epic Polling Failures
There’s no disputing the fact that political polls today have become an exercise in applied statistics, i.e. “big data analytics.” While scientific opinion surveys in the past could be conducted reliability by pulling names and numbers from the white pages, today’s pollsters must create carefully weighted models if they hope to tease a representative sample out of diverse electorates.
When pollsters failed to accurately gauge voter’s sentiment over the “Brexit” referendum in June, it raised a few eyebrows here in the United States. However, when Donald Trump surprised Hillary Clinton in the presidential race in November, contradicting nearly every respected political poll except for one, it became the big data failure of the year, and possibly the decade.
Big Data Breaches
There’s clearly value in data, no matter what the insurance companies or accountants say. So it should come as no surprise that the bad guys want to steal your data—and steal it they have. We’ve seen some very high profile data breaches this year, from the Russian-sponsored hacks of the Democratic National Party’s email servers to Yahoo’s recent disclosure of a compromise impacting 1 billion customers—to say nothing of Yahoo’s September admission that hackers compromised 500 million customers’ accounts.
Rounding out the 2016 Cybersecurity Wall of Shame are the likes of the Department of Justice (lost data on 30,000 DHS and FBI employees); the Internal Revenue Service (700,000 taxpayers records compromised); Verizon (1.5 million customers’ records compromised); Oracle (330,000 compromised MICROS cash registers); Dropbox (admitted to 68 million compromised accounts); and last but not least, AdultFriendFinder.com (412 million users’ records compromised), according to this tabulation maintained by IdentityForce.
New Data Startups
Venture capital investments was down about 10% from 2015, but that didn’t stop prospective tech entrepreneurs from taking calculated risks by creating new companies with the hopes of hitting big data gold. Among the newcomers we tracked this year were:
- SnappyData, which is aimed at uniting Spark and Pivotal’s GemFire data grid;
- Panoply, which is creating ETL software for AWS Redshift users;
- Cosmify, which is using machine learning to mine customers’ knowledge;
- Bonsai, the AI company that won the Startup Showcase contest at the spring Strata + Hadoop World conference;
- Armorway, which is using deep learning for cybersecurity;
- Leyvx, which is merging Flash with Spark;
- Jask, which is using AI for cybersecurity analytics;
- Alluvium, which seeks to close the “machine to human” gap;
- Pachyderm, the container company that won the Startup Showcase at the fall Strata + Hadoop World conference; Skry, a blockchain intelligence vendor; and
- Wavefront, which uses big data to monitor IT
Everybody’s favorite big data bus, Apache Kafka, had an epic year in 2016, thanks to the emerging requirements for analyzing fast-moving data. Kafka is barely five years old, but the LinkedIn-developed message queue is already the defacto standard for managing the flow of streaming data and real-time data pipelines.
Kafka, which is backed by Confluent‘s teams led by Kafka creators Jay Kreps and Neha Narkhede, became one of the most popular big data projects in 2016. The open source project’s adoption rate is soaring, thanks to surging interest in real-time analytics. As the batch paradigm continues to merge with new real-time forms of data processing, don’t be surprised to see Kreps’ unified Kappa Architecture emerge to overtake the overly complicated Lambda Architecture that is currently in fashion.
Open Data Projects
Not all big data products are developed by profit-seeking companies. Indeed, many of the most promising new technologies come to us by wave of open source projects. Among the big data projects making news this year were:
- Apache Arrow: This project, spearheaded by a Drill architect at MapR Technologies, seeks to create a common data layer that will work with a variety of big data tools and engines, like Drill, Spark, Impala, Cassandra, and Parquet;
- Alluxio: This in-memory filesystem emerged from the AMPLab (original name: Tachyon) alongside Apache Spark and Apache Mesos; it’s now backed by a company of the same name;
- Apache Beam: A single API for real-time, interactive, and batch processing with “runners” out to Spark, Flink, and Google Cloud Dataflow is one of the goals of this promising framework;
- CrateDB: Delivered under an Apache 2.0 license, CrateDB is a scale-out SQL database (some might call it a NewSQL database) for real-time machine analytics;
- Apache Kylin: The open source OLAP-on-Hadoop solution spent all of 2016 as a Top-Level project at the Apache Software Foundation;
- Apache Geode: In November, the ASF promoted Geode, a distributed, in-memory database based on Pivotal’s Gemfire, to TLP status.
Big Data for Social Good
Big data analytics is now everywhere. Its presence is felt in the products we buy, the Web services we use, and the way we communicate. But at this time of year, it’s important to remind ourselves of our underlying humanity, and pause to consider what we can do to end the suffering of our fellow men, women, and children.
To that end, it’s good to see that big data can have a positive impact for social progress, not just fattening the bottom line. This year, we told you how big data is being used by groups like Polaris to fight human trafficking and put the perpetrators behind bars. You read how a group of journalists behind the Panama Papers used big data technology, including cloud-based analytics and graph databases, to dissect and expose offshore tax shelters.
Examples of big data’s impact on public health were numerous, including how the CDC is used machine learning to get in front of the opioid-fueled HIV outbreak, how Spark and Hadoop are accelerating cancer research, and how topological data analysis led researchers to reconsider what a “nuisance variable” meant for how best to treat traumatic spinal cord injuries.
We don’t know what 2017 will bring to the world of big data. But if it’s anything like the past 12 months, we’ll get our share of unexpected breakthroughs, spectacular failures, and steady growth in core technologies that are changing how we live.