Wanted: Intelligent Middleware That Simplifies Big Data Analytics
We’ve seen tremendous technological innovation in the data analytics space over the past 10 years. Platforms like Hadoop have emerged and machine learning techniques are going mainstream. But according respected leaders in the big data community, there’s still a sizable gap in the marketplace when it comes to intelligent middleware that can orchestrate and organizes how data flows, not just in Hadoop but the enterprise.
Lars George was one of the first technologists to begin working with Hadoop in Europe back in 2007. At the time, George helped his client build a scalable Web solution, and he naturally gravitated to Hadoop, which provided a basic framework that helped solved that problem, although it did require a lot of coding (George wrote the O’Reilly book on HBase, by the way).
As Cloudera‘s chief architect for EMEA, George doesn’t get his hands dirty in the code as much as he used to. But from his perch in the mountains of southern Germany, George has identifid several patterns in Hadoop adoption, which he described a year ago in a Cloudera Vision Blog post titled “Waves of Adoption – Evolution of Hadoop Users.”
“A year later now, I see the future being slightly more interesting, I have to say, than what I desired in an earlier blog post,” George tells Datanami. “Now where are we? We’re customers who want to use it now. But I think we’re still not there yet. I still think there’s a gap between the technology and the application.”
For Hadoop to truly succeed in becoming the platform for big data, it must become easier for end users to get big data applications up and running. For that to happen, something must be done to make it easier for developers to build applications that are targeted at specific verticals and run right out of the box.
“There’s still a gap,” George says, “Yes you can roll out Hadoop but you need a specialist to make this work for you. You need a data scientist to somehow use your data and produce an outcome that saves money, or to build a recommendation engine. It’s not easy to do this. Hadoop is still very raw technology. Yes it’s getting more and more complete. But we’re lacking infrastructure in between.”
Just as relational databases eventually dissolved into “the stack” and became just another trusted component upon which you ran your Web server or your ERP system, Hadoop must evolve to become an almost invisible layer that enable a new generation of pre-packaged big data apps.
“I liken this to a car,” George says. “How many people can build an engine? There aren’t too many in the world. How many can build a car around an engine? There’s a lot of people who can build a chassis and somehow make it work. But how many use a car? Billions.”
Big Data for the Masses
Lavastorm Analytics CEO Drew Rockwell has identified a similar need for software that can mask some of the complexity that’s still evident in existing big data systems.
“The core problem, or opportunity, as we see it, is there’s been a ton of fantastic innovation on the processing layer and a ton of innovation on the visualization layer,” Rockwell tells Datanami. “But until you can get a less technical user, a business user, at that middle layer to assemble or build analytic applications and use cases and then publish them in whatever environment they want,” then we’ll be unable to progress to the next level.
Lavastorm Analytics is no newbie to the big data game–it’s been helping large organizations correlate and analyze data from multiple sources for more than 10 years. As Rockwell sees it, there’s an emerging need for a middleware layer than hooks together all of the various sources. “We see an emerging space in the marketplace that’s sort of the data assembly and analytic layer. It sits between the processing layer and visualization layer,” he says. “It’s a place where a business user and an IT user can collaborate and build analytic applications together.”
Whereas Hadoop distributors like Cloudera see a future where a single data lake holds all of an organization’s data, Rockwell isn’t buying the single-lake story, and sees a multitude of lakes forming at most organizations. Being able to traverse those lakes without changing tools and skillsets will be key to success.
“It really stems from the death of the idea that all data can be put in one place, and that analytics and reporting are the same thing,” he says. “When you get over those two hurdles you see the need for this authoring and orchestration layer. That layer will need to execute in whatever environment it needs to be executed in–whether it’s Hadoop or a Teradata environment or a DB2 environment.”
Building flexibility into the orchestration layer will pay dividends when technologies change. Because they will inevitably change, Rockwell says. “If you converted all your old SQL scripts to Spark, does that mean you solved all your problem? All your analytic logic that was formerly expressed in 5,000 lines of SQL and is now built in Spark routines–you think that’s a sustainable solution? What happens when that language changes or the next big thing comes along? Everything has to be re-expressed.
“Our view is there’s a real argument to be made here for this analytic orchestration and assembly environment, and that layer really enables flexibility and enables all the pieces in the ecosystem to do what they do well,” he continues. “There’s not enough computer scientist in the world to deal with all the volumes of data and the analytic appetite is huge. If all of that is going to be solved by people with computer science degrees, then you’re never going to get the value that you need. So you need to create alternative ways to take less technical people and allow them to work with data at field-level in a record to extract insights. That’s what Lavastorm is about.”
Blueprints for Hadoop
While they may disagree on data lake architectures, Cloudera’s George and Lavastorm’s Rockwell agree on this: that relying data science “unicorns” is an unsustainable business model.
“If we think we can hire or train more data scientist to accelerate the adoption of Hadoop, I think we are mistaken,” George says. “We really need to do this differently.”
Instead of requiring each Hadoop customer to have a data scientist on staff to customize each application, in George’s view, the data scientists should be working at the software vendors that develop the big data applications—i.e. crafting the algorithms that can pick out the signal from the noise—and Hadoop operators and business analysts can do the remaining 20 percent of work required to get it running at each customer site.
Cloudera and other Hadoop distributors, including MapR Technologies, have rolled out pre-built application templates for some of the most common use cases, such as fraud detection and recommendation systems. In Cloudera’s case, the company has pre-built “blueprints” that shorten the development cycle–but that only gets the customer so far, and there’s three to 12 months of custom development and training.
“The cycles get shorter, but they’re not really fast enough,” George says. “If you want to reach more end users and get more use cases into the end customer and expand the footprint of Hadoop as a central data hub, then you….should be able to go to a marketplace and download the fraud detection application, and it should get you 80 percent where you need to be. You still can configure this…but you don’t have to learn Hadoop in the first place.”
But that’s just part of the solution that George sees. Data governance is a major issue impacting Hadoop and the big data community at large. Nobody has solved it with an open source project or shrink-wrap product yet. He’s seen projects with part of the solution flare up and get hot, only to fizzle and fade away. Products from Cloudera partners like Cascading and Cask are great at abstracting the view of Hadoop data. But George has yet to see a single project or product that can connect all the dots.
“All these things help you to abstract the view of the data, but they’re not describing where it lands in Hadoop, who has access to it, when do you recompress the file or compact the files to fewer larger ones, when do you move them into archive directory, when do you repack them with a higher compression ratio, because you’re archiving them but you still want to have them active. When you do delete them?” he says. “I’m talking middleware of sorts that really describes all the functions in between.”
This “data as a service” offering would orchestrate the access and flow of data to different applications siting on Hadoop, he says. “At scale we want to land the data once and bring the application to the data and not move any of those bits, which means we need to have a way that says, you have 10 applications, and they share these five data sets over here, and this application can produce some data that this application over here can consume.”
This product or service would not only make life easier for Hadoop users, but it would simplify things for third party application vendors too.
“If we had a prescribed way of organizing data in Hadoop, then the tool vendors would have it much easier to use their existing tools to plug into that and say ‘Yes I know how this works, I can read the schema of the structure the customer has chosen from some sort of endpoint. From there I can understand what the data sources are, I can suck up the metadata that’s already available.’ We have metadata management coming into Hadoop slowly. But it will also help these sort of vendors to use their existing tools and make them work with the platform. But the platform itself doesn’t provide any help to them.”
Hadoop needs this to survive in the long run, he says. “If had a wish, I’d start on this today, but I’m busy with our customers,” George says. “If we don’t have this, then I think it will hinder us in getting more people using Hadoop.”