Mike Olson on Zoo Animals, Object Stores, and the Future of Cloudera
During last week’s Strata Data Conference, Datanami sat down to talk with Cloudera Chief Strategy Officer Mike Olson to talk about the state of big data, and where Cloudera offerings are headed next. Here’s a rehash of that conversation.
When Olson co-founded Cloudera in 2008 and served as its first CEO, Apache Hadoop was just starting to register as a blip on the radar screens of tech executives, who were wondering how they were going to store all that data and what they could do with it. Now 10 years later, Olson is still a driving force for Cloudera, but the strategy has changed significantly for the company.
For the first phase of the business, Cloudera wrangled zoo animals, Olson says. “There were Pigs and Hives and Zookeepers and the yellow elephant, and we wrangled them all,” he says. “For about four years we were super evangelical in explaining what big data was and why all those open source projects manner.”
In the 2013 time frame, as the big data technology boom got bigger, Cloudera introduced the enterprise data hub (EDH). “It was a reformulation of the earlier platform story,” Olson says. “All of those projects were in the box, but we didn’t call them out separately any longer because it was the comprehensive capabilities, knit together by security, governance, compliance — all the shared data experience services we’ve rolled out.”
Cloudera only recently departed from the EDH with a new strategy that widens the company’s focus into specific areas, including data warehousing, machine learning, and cloud computing. The company has assigned a general manager for three areas, including Anupam Singh for analytics, Hilary Mason for machine learning, and Vikram Makhija for cloud, while Fred Koopmans heads up the enterprise platform.
Data warehousing and analytics is the farthest along among the three business areas, Olson says. “The major driver of growth on our platform right now are analytics databases, and especially data warehousing workloads,” he says. “Maybe I’m a Netezza customer and I have a few hundred terabytes in my cluster, but IOT is happening to me and I’d like to be able to keep a decade’s worth of data and not just a year or a quarter’s worth of data. So the data volumes are going from 10s or 100s of terabytes to petabytes, and the modern platform gets to handle those and the legacy architectures really can’t.”
That “modern platform” is an amalgamation of various open source projects that are collectively referred to as Hadoop, which is still important to Cloudera, Olson says. “So we’re continuing to innovate there,” he says. “But that’s really just table stakes. What’s in interesting is what do you do with the data once you get it in the data lake. That’s our focus now.”
Apache Impala is one of the fastest growing engines in the Hadoop ecosystem at the moment, according to Qubole’s recent survey, where it placed second only to Apache Flink, which grew more than 100%. Cloudera is keen to ride Impala’s blazing interactive query speed — as well as Kudu’s advantages in fast IoT data ingest — to fortunes in the massive data warehousing market.
“If you think about it, right now, there’s $16 billion per year spent on data warehousing,” he says. “That’s a rich field to plow, my friend. The machine learning installed base is vanishingly small but gonna absolutely explode, and it’s possible for a new vendor in the market to go to dramatic market dominance very quickly. So it’s a new business for us but it’s one we’re super, super bullish on.”
Olson says he spends much of his time these days ensuring that Cloudera’s solutions are working well with the Big Three of cloud vendors. “We’re exploding into cloud, so I’m spending a substantial part of my day working with Amazon, working with Microsoft, working with Google to be sure that we run well and that we have the go-to-market alignment with those guys and that we’re integrating with the right level of services.”
Cloudera has already done the work to ensure that its platform works with cloud technologies, specifically by supporting the object storage systems that underlie each of the public cloud platforms. One of the next big steps will be to support hybrid software delivery methods, where customers have the freedom to deploy the same application or applications to public clouds, on-premise clusters, or a combination of both. Kubernetes and Docker are seen as big answers to part of that question.
“The storage layer essentially becomes the data lake and you on-demand provision dedicated compute clusters just for the job you’re running, so that’s been really liberating in the cloud,” he says. “That was an impossibly hard thing before Docker and Kubernetes came round….Docker was a super successful project, but until there was good orchestration, until Google made Kubernetes open source, there was no good answer there. And now I think we have the building blocks we need.”
But the other part of that story has to do with the object storage system, and the answers aren’t as clear there, Olson says.
“We’ve already done a big chunk of what we needed to do, which is when we moved to the cloud and embraced the object stores in the cloud, we separated compute and storage,” he says. “So we did that one time already. So we got a bunch of that plumbing. We need to move it on prem now, and that’s a little bit tricky, but Kubernetes and Docker make that easier.
“What we need is storage virtualization on prem,” he continues. “There is no widely adopted, reliable object store in the data center. That would make this a lot easier. You can create by the way a shared HDFS cluster, where you’ve got traditional HDFS and then you can fire up and fire down compute clusters on top of that, and that’s likely an interim step that we’ll take while the market for object store shakes out on prem.”
The object store market is quite diverse, with 21 different vendors, all of which have “miniscule market share,” Olson says. While many (if not most) of them are API compatible with Amazon’s Super Simple Storage (S3) object store, and everybody agrees that S3 is the right path forward, that doesn’t necessarily make the job any easier, he says.
“S3 likely will be what everybody adopts, but somebody needs to be the variant of S3 on-prem that takes over the market so we can just go to that one,” he says. “We don’t expect to be a general purpose object store vendor. Cloudera won’t go into that business. But we’d very much like to see a market leader emerge and the sooner that happens, the better we think for everybody.”
While Cloudera and its employees who sit on Apache Software Foundation projects were instrumental in the development of the Hadoop Distributed File System (HDFS), that work was a matter of necessity. Today’s Cloudera does not want to be in the business of creating a next-gen object storage system to support the hybrid big data apps of the future, since it sees its future developing apps that sit higher in the stack, like data warehousing and machine learning.
“You would not believe how expensive it is to support a new object storage system,” Olson says. “It touches every single component, all of the security stuff that we do has to be integrated with whatever the object store surfaces. Every single analytics or data processing engine needs to code in those APIs. And all the vendors will say, yeah we’re S3 compatible. Dude, do you think so? You’re are not bug compatible with S3. And your performance under duress is going to be different. And then you’ve got to design around the needs of those performance curves.”
Olson likes what he sees in Red Hat and the folks behind Ceph, whom he called “lights out good.” But until that common object store of the future arrives, Cloudera will just have to wait for a complete hybrid solution, just like everybody else.