Follow Datanami:
September 13, 2018

Cloud Looms Large at Strata, and So Does Kubernetes

(Valery Brozhinsky/Shutterstock)

While there was heavy cloud cover in New York this week, cloud computing could be seen lifting Strata Data Conference, where vendors scrambled to ensure their products can support cloud and on-premise environments simultaneously, which necessitated a steady diet of Special K.

If a Strata attendee had a nickel for every time she heard a particular big data vendor supports cloud, multi-cloud, or — better yet, hybrid on-premise and cloud deployments — she might have enough to buy a grande mocha frappuccino at one of the many Starbuck’s that dot the Javits Center footprint in New York City.

Whereas Hadoop was once the connecting glue holding this merry band of big data hackers together, now it appears that technological baton has passed to Kubernetes, the Google-developed orchestration layer that lets users move workloads to and fro effortlessly, with just a few clicks of a button.

Cloudera itself is in the middle of that the journey, according to co-founder and Chief Strategic Office Mike Olson, who said the company is working with the Apache Hadoop project and related communities to support Kubernetes with the range of products that compose the Enterprise Data Hub.

According to Olson, the work on Kubernetes is on-going, and it appears there will be a ready-to-use product in the 12-to-24-month timeframe.

“We have been waiting for private cloud infrastructure to emerge for a long time,” Olson tells Datanami during a briefing at Strata on Wednesday. “Containerization via Docker and the maturation of Kubernetes for orchestration gives us a way to do a whole bunch of what YARN was doing. I think you’ll see that move, accelerate. Everybody has accepted those technologies. The open source communities, individuals on the projects, are beginning to integrate with them in a serious way. So I do think there’s a good private cloud story in the future.”

Clouds are definitely rising in importance, but that doesn’t mean than on-premise Hadoop clusters go away. By all accounts, usage of Hadoop — which is currently defined as X86 clusters that use HDFS and YARN — is still growing. And several vendors mentioned this week that they have many Global 2000 companies who tell them they will never get rid of their data centers.

But the nature of Hadoop itself is changing, and even Olson says one could have “Hadoop” without YARN or HDFS. Kubernetes is the odds-on favorite to replace YARN in many situations, while HDFS will likely be displaced by object stores (although Olson doesn’t think on-prem object stores are ready for prime-time).

“YARN is good at long-running batch job scheduling,” Olson says. “But as a general purpose resource management framework for clusters, it was never well-designed. It may never go away because you probably want to do long-running Spark job allocations. But it won’t be the resource management framework for the rich ecosystem in the long term, especially with private cloud.  Kubernetes is going to step in and take over a whole bunch of that. I’m actually really bullish on that.

In the future, Cloudera expects to be able to give customers the capability to spin up big data workloads — data warehousing and machine learning primarily — on the customers’ choice of infrastructure, including on-premise or a public cloud. What’s more, thanks to Docker containerization and Kubernetes orchestration, they will be able to move those workloads to any other supported environments, be it on-prem or cloud-based, in a fairly simple manner.

Hortonworks, which competes with Cloudera in the big data platform space, also this week unveiled plans to support Kubernetes as part of its Open Hybrid Architecture initiative. The plan, which it hatched with IBM and Red Hat, includes three phases: containerization of its Hadoop and streaming data products; support for object stores via separation of compute and storage in Hadoop; and finally support for OpenShift, Red Hat’s flavor of Kubernetes, which will deliver ultimate portability.

(Timofeev Vladimir/Shutterstock)

“Just as we enabled the modern data architecture with HDP and YARN back in the day, we’re at it again,” says Hortonworks co-founder and CTO Arun Murthy in a blog post, “but this time it’s bringing the innovation we’ve done in the cloud down to our products in the data center.”

While Kubernetes won’t be supported by Hortonworks or Cloudera for at least a year, there are some forward-looking companies that are already playing around with Kubernetes in their Hadoop clusters.

We would be remiss if we didn’t mention MapR Technologies, which delivered Kubernetes support via a new volume driver for its big data platform in March. The vendor, which uses open source Hadoop tech but also includes a number of proprietary creations inside its data platform, also supports Docker and has an S3-compatible storage API (it also supports NFS and POSIX APIs).

The march to Kubernetes is also occurring within the larger big data vendor community, from the data catalog firms like Alation to the data pipeliners like Talend, and so many others in between.

“It’s a heavy lift,” says Trifacta CEO Adam Wilson, on the difficulty of supporting on-prem Hadoop and the three major cloud providers — Google Cloud Compute, Amazon Web Services, and Microsoft Azure. “In the majority of our customer base there are certainly cloud projects being spun up and we’re involved in more cloud today than ever before.”

Even though Kubernetes is not yet widely used in the big data community, companies are starting to play around with it. “We have a customer that’s standing up a small Kubernetes environment right now,” says Dan Marx, who was named Pepperdata‘s vice president of sales last month. “But production-wise, it’s probably not going to be there until the second half of next year.”

While abstraction layers like Kubernetes simplify things for one group of people, they invariably add complexity to others. For administrators in charge of Hadoop clusters, Kubernetes brings the potential to complicate performance issues, which is why Pepperdata sees the potential for it to increase demand for its performance-management software.

“That at the end of the day is what drives the roadmap,” Marx says. “We’ve been hearing Docker. We’re absolutely hearing Kubernetes is going to be the next generation of what they’re going to be looking at.”

BlueData, which develops software for allowing big data software to run in containers, announced this week that it’s supporting its software on Google Cloud and Azure. “Many of our customers have embarked on a multi-cloud or hybrid cloud strategy for their AI, ML, and analytics initiatives,” said Kumar Sreekanti, co-founder and CEO of BlueData. “Our mission is to help these enterprises accelerate their digital transformation journeys, while making the underlying infrastructure invisible with containerization and automation.

Striim, which develops streaming data integration and analytics solutions, announced that it’s supporting Google Cloud. Several other vendors told Datanami they expect to be supporting the three major cloud providers in the near future.

Related Items:

Five Things to Consider as Strata Kicks Off

‘Open Hybrid’ Initiative Targets Big Data Workloads

One Containerized View of Data Science’s Future