Streaming Analytics Picks Up Where Hadoop Lakes Leave Off
If phase one of the big data boom was store as much data as your lake will hold, phase two will be about extracting information from the data as quickly as you can. For many organizations, that means using streaming analytics to both shrink the decision window and reduce the flow of data into the lake.
As we’ve talked about in recent articles, Hadoop-based data lakes aren’t going to go disappear into the digital night. HDFS, for all its faults, is the best thing going for massive petabyte-scale storage–at least until we have on-premise, S3-compatible, cloud-like object file stores that take us into the exabyte realm.
But going forward, Hadoop won’t be the center of the analytics universe. In fact, it seems likely that many organizations will find that stream processing systems can accomplish the bulk of the company’s analytic goals while positioning them architecturally to take advantage of future innovations that are sure to come floating down Open Source River.
Steve Wilkes, the founder and CTO of streaming analytics software vendor Striim, says at the recent Strata + Hadoop World show that the folks at O’Reilly maybe should have renamed the show “Strata + Streaming World.”
“In our space, what we’re seeing — and you can see it evident everywhere now – is the notion of streaming data and being able to deal with data as its being created, as opposed to after the fact, is really is starting to take hold,” Wilkes told Datanami during the show. “There’s definitely this case for careful consideration with what you do with technology and what’s the most appropriate technology for your means.”
Kostas Tzoumas, one of the creators of Apache Flink and the CEO of data Artisans, says getting out of a Hadoop-first mindset is freeing people to consider the wide assortment of stream processing technology that’s being created and which has no ties to Hadoop.
“For a while, for some weird reason, every Apache project was considered as part of the Hadoop stack, but if you look at the deployments, this is really not the case,” Tzoumas said at the recent Strata show. “This is not about Hadoop anymore. This is really about data and data processing. Everything that is running continuously and processing data is a great fit for stream processing. We should get out of the mindset of putting everything under Hadoop.”
It’s a message that the Hadoop distributors (we will probably be asked at some point to stop calling them that) are in fact getting, and is evident through their actions to add new streaming data capabilities based on open source projects like Apache Kafka, Apache Storm – and Apache Nifi in the case of Hortonworks — to their core offerings. In most cases, the Hadoop distributors are hoping to extend the investments they made in securing, managing, and governing their customer’s Hadoop clusters into the emerging streaming paradigm.
In many cases, this will extend the Hadoop distributors into the market for managing the storage and analysis of data generated in Internet of Things (IoT) applications. Whether this approach will succeed has yet to be decided. In any case, vendors like Striim will also be chasing the opportunity with their own products that combine open source and proprietary software.
“Scale is the future,” Hortonworks CTO Scott Gnau told Datanami at the show. “This gets to the broader footprint. It will not just be about how many exabytes of data you have in your cloud footprint. It will be, Do you have an application out on the edge in the device that’s doing processing on its own?”
Hortonworks unabashedly reaffirms its commitment to not owning any intellectual property (IP). Everything on its product menu is open source. This approach works for the do-it-yourself data group that’s comfortable working on the bleeding edge because the payout is potentially orders of magnitude larger than the risk, but it’s not necessarily workable for the smaller shop that can’t afford to invest millions in data scientists and data engineers who can work with open source tools.
Organizations that don’t want to deal with the open source burden will look to software vendors that can provide an abstraction layer that insulates them from rapidly changing open source technology. This is the approach that Impetus Technologies is taking with StreamAnalytix, which can use Spark or Storm engines to power streaming analytics.
Anand Venugopal, the head of StreamAnalytix at Impetus, says customers may get started easily enough with open source tech, but often run into problems down the road when they try to take it into production. “They don’t know how to scale” the systems, he said at Tabor Communications LBD + EHPC 2017 conference held last week at Ponte Vedra Resort in Florida. “They get stuck everywhere.”
Impetus unveiled a new version of StreamAnalytix at the recent Strata + Hadoop World conference. With version 3, the company now lets users develop and integrate Spark-based batch workflows right alongside their streaming data workflows. Venugopal says Spark Streaming has rapidly gained popularity. “However, most enterprise big data use cases today need both Spark Streaming and Spark batch,” he said.
As the IoT kicks into gear and the world starts generating more data than can be realistically stored, the need for real-time analytics will increase. Organizations are positioning themselves now to get ahead of this tidal wave of data, and stream data processing figures to play an important role in the architectures now being devised.