2020: A Big Data Year in Review
There are two weeks left in 2020, which means it’s time to exhale a bit and see where we’ve gone. It’s been a bumpy ride over the previous 50 weeks, for sure. But big strides have also been made for those pursuing big data, advanced analytics, and AI, and those accomplishments deserve some credit.
The big story of 2020, of course, was COVID-19, which as of this writing has infected 75 million people and been implicated in nearly 1.7 million deaths around the world, according to the Johns Hopkins Coronavirus Resource Center, which emerged as the most trusted dashboard tracking the pathogen’s global spread.
In January, when the novel coronavirus raised its nasty little head, the novel coronavirus was seen primarily as an issue for China. But by early February, when we first started writing about COVID-19, the analytics community was already gearing up for a response. Little did we know that it would dominate news coverage globally and in the big data community for the next 10 months (and likely longer).
At Datanami, we wrote more than 100 articles about various data-related aspects of COVID-19. They ran the gamut, from how AI was helping to find therapies and the surge of open data sets to the potential for graph databases and the impact on data jobs. Unfortunately, the analytics response to COVID-19 unearthed some longstanding challenges that teams of people seem to always run into when working with real-world data, particularly around the quality of COVID-19 data and the trustworthiness of the data, as well as questions about privacy tradeoffs.
The other major story of the year was the rise of the public cloud. The cloud was already growing fast at the beginning of 2020–and then COVID-19 happened and cloud growth kicked into overdrive. AWS grew at a 33% rate in the first quarter ended March 31, Google Cloud at 34%, and Microsoft Azure at a whopping 59%. Amazon.com, meanwhile, went on a hiring binge, bolstering its employee rolls by 380,000 over the past year. Collectively, Amazon, Microsoft, and Google their collective market capitalizations from $3.07 trillion to $4.43 trillion since December 30, 2019. (Apple, which does not run its own cloud, grew its market cap by nearly a trillion dollars.)
While the stock market rewarded Big Tech investors, patience with the firms’ impact on privacy appeared to be inversely proportional to their stock prices. In July, the CEOs of Facebook, Amazon, Google, and Apple testified remotely in a House Judiciary Committee hearing over their business practices. Anger of Big Tech’s data privacy abuses was close to boiling over this fall following the release of Netflix’s docu-drama “The Social Dilemma.” Less than a month before the presidential election, after Facebook and Twitter blocked circulation of an explosive story about the lost laptop of then-president candidate Joe Biden’s son, Hunter, several US Senators called for the removal of the social media’s Section 230 protection in the 1996 Communications Decency Act.
As clouds grew, it presented a winning opportunity for cloud data warehouse vendor Snowflake, which had a very successful IPO in September. SNOW raised $3.4 billion on the NYSE at a valuation of $33 billion, earning it the title of largest ever IPO for a software company (even if Snowflake technically is in services). Other tech IPOs to take place this year included Sumo Logic, C3.ai, and Palantir (technically a direct listing), and Databricks is eyeing its own public debut in early 2021.
Speaking of clouds, Cloudera seemed to find its footing in 2020 after a tumultuous couple of years that saw the market for Hadoop disintegrate as cloud-based big data alternatives gobbled up share. The company launched its remade platform in late 2019, and by December 2020, it was close to fulfilling the promised data management capabilities for customers on the cloud, on-prem, and (most importantly for Cloudera’s vision) hybrid big data management.
As organizations continued to pile data into data lakes, native cloud data warehouses like Amazon Redshift, Azure Synapse Analytics, and Google Big Query gobbled up much of the SQL analytics workload. Getting data into these warehouses has been good business for a new group of ETL (and ELT) software vendors, such as Fivetran and Fishtown Analytics.
Amid the ETL-ELT data pipeline brouhaha, Presto has emerged as a possible solution for this mess. As a distributed SQL query engine (Facebook’s follow-on to Apache Hive), Presto doesn’t store data and instead pushes queries to external data stores, including data lakes. Keen to capitalize on Presto’s rising star, Ahana emerged this summer to challenge Starburst’s hegemony over all things Presto. Unfortunately, this has split the open source Presto community, which is something we’ll be closely watching in 2021.
Another item to watch next year will be what happens to the California Consumer Privacy Act (CCPA), which went into effect on January 1 (although enforcement didn’t start until June 15). The law was thrown for a loop in November, when California voters passed Proposition 24, which instituted the stricter California Privacy Rights Act (CPRA). CPRA won’t go into effect until 2023, and when it does, it will be enforced by a new government entity called the California Privacy Protection Agency (CPPA). But in the meantime, the CPPA will be built and will enforce the CCPA. While the mix of privacy laws in California is confusing, it’s just a drop in the bucket for global companies, which now have dozens of laws to adhere to.
Concerns over the potential for abuses of AI technology gained steam, particularly following the revelation last January that a company called Clearview harvested hundreds of millions of images of people’s faces from Facebook, Google, Twitter, and LinkedIn to train its facial recognition algorithm, which is sold to law enforcement. There are now numerous municipalities around the world that ban the use of that technology, and most of the big tech firms say they will no longer sell facial recognition services.
Following the death of George Floyd and a summer of political unrest amid campaign rallies and Black Lives Matter protests, questions about how AI can amplify bias and perpetuate racism became a pressing topic. The potential for algorithmic bias is so great that half of executives say they’re slowing down their AI initiatives to ensure their systems are working fairly, according to a Deloitte study. The matter reached a head in December, when Google and AI researcher Timnit Gebru parted ways following a dispute over an unpublished paper about the role that large NLP models play in perpetuating bias.
The language models got really, really big this year. In February, Microsoft released the details of T-NLG, which, with 17 billion parameters, was the largest language model ever. But that looked like child’s play in July, when OpenAI released GPT-3, with a staggering 175 billion parameters. The model was so good that people had trouble distinguishing sentences written by GPT-3 from those written by actual humans.
Despite the growing sophistication of machine learning models, political pollsters had another off year in 2020. Following the dismal showing in 2016, in which nearly every major political poll failed to detect widespread support for Donald Trump in the presidential election, many of the same data errors were at play again in 2020. While Trump failed to win at the top of the ticket, there were down ticket Election Day surprises for the Democratic Party, which had expected big gains.
Trends in tech continued to reflect the movement of big data. Managing data in a hybrid and multi-cloud world was a top concern, right alongside the need for better data governance in a multi-cloud and hybrid world, not to mention security. Companies started to get smart about wasting billions of dollars worth of cloud computational resources, which led to an interesting conclusion: the cloud was never about saving money in the first place–it was always about flexibility! And after years of hype, Kubernetes had a bit of off year, as people began wondering whether it is too complex, particularly in growing edge use cases.
R, which some folks left for dead amid the dominating grip of Python, made an amazing comeback this year. The language, long favored by academics thanks to the large number of statistical packages available for it, rose 12 spots on the TIOBE Index to land at number 8 this summer. The source of the comeback? Possibly a surge in interest in data science due to COVID-19.
Despite tough times on Main Street, venture capitalists on Sand Hill Road continued to place big bets on tech firms. Datarobot brought in $270 million as it eyes an IPO. Confluent brought in $250 million, as did Cohesity in a Series D round. Qumulo had a $125-million round at a $1.2-billion valuation. Collibra completed a $112.5 million round at a $2.3-billion valuation. Couchbase nabbed $105 million. Dataiku had a $100 million Series D.
Cockroach Labs completed an $87 million Series D. BigID had a $70 million round at a $1 billion valuation. Dremio also brought in $70 million in a Series C. Grafana landed a $50-million Series B. ChaosSearch brought in $40 million, the same amount that Rockset announced. SQream nabbed $39.4 million in a Series B+ round. Habr brought in $38.5 million. $32 million materialized for Materialize. Fishtown Analytics brought in $29.5 million. Pachyderm completed a $16 million Series B round. Dathena had a $12 million round. Brytlyt brought in $4 million, and Ahana raised $2.25 million in a seed round
There were also some name changes. Syncsort became Precisely, MemSQL became SingleStore, and SoftNAS became Buurst. The acquisition front was fairly quiet. Qubole was bought by Idera in October. Hitachi bought Waterline Data, as well as a company called Containership. TIBCO bought Information Builders. Qlik bought Blendr.io as well as a company called RoxAI. Intel acquired SigOpt. Splunk bought both Plumbr and Rigor.
The COVID-19-driven shift away from the physical world toward online learning and remote was hugely disruptive. Zoom buckled, but did not break, under the load, and the company’s stock grew 10x in response. In April, Microsoft CEO Satya Nadella declared: “We’ve seen two years’ worth of digital transformation in two months.”
Another staple of tech work was rudely interrupted by the pandemic: the humble tech conference. Thousands of in-person events were cancelled over the past nine months, and thousands more likely will not take place over the ensuing months. A side effect of this massive shift is the emergence of virtual conferences. These were hit and miss affairs, as tech companies grappled to find the right mix of live interaction versus recorded videos.
COVID-19 also gave rise to a phenomenon the likes of which has never been seen: the multi-week tech conference. Google Cloud boldly ventured into unexplored territory this summer when it launched a nine-week conference, Next ’20: On Air. By comparison, AWS’s re:Invent, which just concluded, lasted only three weeks. Will those records be broken in 2021? Only time will tell.