Five Things to Consider as Strata Kicks Off
Today marks the start of the fall Strata Data Conference in New York City, which has traditionally been the big data community’s biggest show of the year. It’s been a wild ride for the big data crowd in 2018, one that’s brought its share of highs and lows. Now it’s worth taking some time to consider where big data has come, and where it’s possibly headed in the future.
Here are five things to keep in mind as the Strata Data Conference kicks off.
1. It’s Not Just Hadoop Anymore
We’ve said this before, but it bears repeating: Hadoop is just one of many technologies angling for relevance in today’s increasingly heterogeneous at-scale computing environment. Indeed, it’s been over a year since Cloudera and O’Reilly Media renamed the show that for many years was named “Strata + Hadoop World.” Hortonworks made a similar move with its show.
While Hadoop remains a core option for storing massive amounts of data, it’s increasingly been marginalized as a computing resource. Some of that is due to the continued technological progress of other database platforms, including SQL and NoSQL systems, but also the fact that so much can be done with today’s stream computing systems and cloud-based object stores.
And while there is still some validity to the idea that companies should congregate all their data in one place and move the applications to data, it’s becoming abundantly clear that this is not feasible in many – if not most – situations. As data volumes continue to grow at near exponential rates, companies continue building data silos designed to serve a single purpose, and there’s no indication that practice will end any time soon.
That doesn’t mean Hadoop has no role. Indeed, its usage still appears to be growing, and many organizations are getting lots of value out of their Hadoop data lakes. But it’s clear that the dream of Hadoop playing a central – and centralizing — role in big data is over.
2. Cloud Looms Large
As Hadoop dims a bit, there’s another star rising on the horizon: the cloud. Public cloud platforms like AWS, Azure, and Google Cloud have been the big beneficiaries of the ongoing shift in the big data architectural strategies of enterprises large and small.
Where data once was penciled to flow into HDFS, it’s now roaring into the respective object stores of the cloud providers: S3 for Amazon, ADLS for Microsoft, and Google Cloud Storage. Survey after survey show that companies are moving data into the cloud at a breakneck pace, and there’s nothing that indicates that data flow will slow any time soon.
And what’s not to like about the cloud? Companies can scale their compute tasks up and down as needed, while taking advantage of an a la carte menu of big data services: SQL analytics, machine learning, streaming analytics, NoSQL databases, graph processing, and more.
Of course, there’s a price to be paid – and we’re not talking about the data movement costs (although those can be hefty, too). While open source drives much of the progress in big data, the services offered by customers are typically proprietary, and migrating these workloads off the cloud is often difficult, if not impossible. But for some cloud customers, that vendor lock-in never felt so good.
3. All About the AI, Baby
The rapid emergence of deep neural nets is propelling a massive resurgence in artificial intelligence, as fields like computer vision, speech recognition, and natural language processing experience an explosion of progress.
While machine learning has always been apropos of the big data discussion, today it’s dominating the conversation in the form of deep learning. Buoyed by frameworks like Tensorflow and PyTorch that simplify deep learning, companies are rushing to build AI smarts into a variety of business processes, from ERP systems and enterprise search to human resources and social media analysis.
Despite the high level of hype, however, this new form of AI remains a tough nut to crack. Training a deep neural network typically requires a huge amount of processing power, typically provided by GPU clusters, which are not cheap. Also, deep learning also requires lots (and lots!) of clean, labeled data, which is also something that remains in short supply for all but the biggest enterprises. It’s no surprise that cloud providers are leading in the AI department, since they have the world’s biggest data centers and much of our data.
Even so, AI continues in a raging “summer” phase for now, and there’s no indication the heat is fading. With new AI chips on the horizon, not to mention billions flowing from venture capitalist into AI startups, there’s a definite feeling that we’re on the cusp of seeing AI propel some gargantuan hits in the near future. Short sellers, beware.
4. Governance’s Dirty Laundry
If there’s anything that could trip up the big data and AI juggernaut at this point in time, it’s government regulation and a backlash against big data abuses and security breaches. Large companies around the world absorbed the first big regulatory blow earlier this year when the General Data Protection Regulation (GDPR) went into effect. We’ve yet to see the European Union take its first action in response to violations of the new law.
Forward-looking organizations took the passage of GDPR – as well as big data abuses, such as the Cambridge Analytica scandal that plagued Facebook — as a sign that they needed to clean up their big data acts, or governments would do it for them. This has spurred investments into governance and security software, and led companies to ensure their ducks are in a row before expanding internal access to big data sets.
In the United States, where a general lack of regulation has made big data nearly a free-for-all up to this point, a heightened focus on governance has slowed rollouts of big data analytics programs to some extent. Companies that are hesitant to invite regulatory scrutiny are now taking time to ensure data is properly governed and secured.
Increasingly, we’re seeing companies appoint chief data officers (CDOs) to oversee the handling of data, which remains one of the hardest parts of the big data exercise. After all, if you can’t manage the data effectively, your chances of succeeding with machine learning and AI are slim to none.
5. Skills In Demand
There is never a perfect match between the requirements of enterprise and the skills available in the employee pool. However, in big data, we’re continuing to see rapid evolution of technology, and with it the skills needed to make it go.
Data scientists continue to be in high demand, which is to be expected when statistical complexity exceeds what a layman can offer. But we’re also now seeing demand spike for various forms of engineers – data engineers, data science engineers, and even machine learning engineers. As companies deploy and maintain data-hungry AI models, the skills needed to keep those models running are not necessarily in the data scientist’s wheelhouse.
This skills gap makes educational venues like Strata Data Conference necessary for folks who intend to stay on the cutting edge. Organizations that are building and deploying big data solutions need to know what the technology can bear now, and what direction it will go in the future. Shows like Strata are still invaluable for giving us hints as to what technology will deliver to us in the future.