How Cloudera Is Battling Shadow IT with CDP
“Come use our analytics service in the cloud, but don’t tell your IT department. It’ll be our little secret.” The siren call of shadow IT has tempted many business units with promises of fast results. But it’s also exacerbated the data management, security, and governance problems. With the new cloud architecture underlying the Cloudera Data Platform (CDP), Cloudera is hoping to give business units the data analytics agility they crave, but without giving up centralized control.
Cloudera formally launched CDP last week during the Cloudera Analyst Conference, which took place a few blocks away from the Javits Center in New York City, where Cloudera would co-host the Strata Data Conference with O’Reilly Media later in the week. For analysts and journalists covering big data, Cloudera AC was a chance to get up close to the folks pulling the levers and driving the new company’s strategy.
These are critical days for Cloudera, which has had a tough go of it ever since the disastrous second quarter results that led to the ousters of CEO Tom Reilly and Mike Olson, who co-founded the company 11 years ago and was its chief strategy officer. The crux of the problem was this: the on-prem Hadoop business has been usurped by the cloud Hadoop business, as manifested by Amazon Elastic MapReduce, Microsoft Azure HDInsight, and Google Cloud DataProc.
Cloudera needed to recalibrate its bearings and adjust to the new world order – and quickly. In product terms, it meant nothing short of a complete architectural overhaul, replacing the guts of its Hadoop distribution, namely HDFS and YARN, with cloud-friendly equivalents, namely an S3-compatible object store and Kubernetes. And it needed to do so in a way that provided some backwards compatibility with the products it was replacing, the Cloudera Distribution of Hadoop (CDH) and Hortonworks distribution, HDP.
With last week’s launch of CDP on AWS, the company appears to have completed the first step in its journey to a new cloud architecture for Hadoop (or whatever you want to call it). The company so far only has the data warehouse and machine learning SKUs running atop the new S3 and Kubernetes underpinnings on AWS, as well as a full stack app running on YARN on EC2, called Data Hub.
But the company is just months away from delivering an on-prem version that supports the S3-compatible object store and Red Hat‘s Kubernetes distribution, called CDP, Data Center Edition. It’s also ramping up the same software as a service (SaaS) kit on Azure and GCP for 2020. And work continues on additional SKUs, for streaming data, data engineering, and operational databases, all of which should be available in your choice of cloud, on-prem, or hybrid deployment models.
Perhaps more importantly, Cloudera has delivered two add-on components that get right to the heart of its war on shadow IT. Those include:
- Shared Data Experience (SDX), which provides security, governance, and lineage to data stored across all Cloudera solutions, including existing on-prem CDH and HDP clusters, as well as hybrid deployments that combine both;
- And Control Plane, which functions as a “single pane of glass” for administrators to spin up and spin down clusters in the cloud, on-prem, and hybrid scenarios.
After all the company has been through – the merger with Hortonworks, the slowdown in sales, the departure of key leaders, the loss of billions in market capitalization, and the ongoing death knells of the cuddly yellow elephant named Hadoop – the execs on-stage at Cloudera AC seemed relieved to have gotten this far.
“It feels like we’ve been in a cave for nine months and now I finally get a chance to meet with people outside my engineering team and kind of show off what we’re doing,” said Arun Murthy, Cloudera’s chief product officer, the Hortonworks co-founder, and the man tasked with guiding development of the new CDP.
Casting Cloud Shadows
Enterprise IT managers are between a rock and a hard place when it comes to satisfying the big data processing and analytic needs of different lines of business (LOB) constituencies, said Fred Koopmans, the vice president of product management at Cloudera.
“IT users have their own objectives that they’re solving for and this is where the tension arises,” Koopmans said last week at Cloudera AC. “It’s all of these individual personas. It’s the infrastructure manager who’s trying to supply guaranteed capacity for the data engineer and on-demand capacity for the analysts. It’s the infrastructure manager supplying the latest version of everything to the data scientist, while never giving downtime and disruption to application developers. That’s where the friction starts to emerge.”
CDP and HDP currently are the only on-prem data platforms capable of handling these different demands while providing centralized data management, security, and governance, he said. But there are drawbacks.
“Everyone on that platform has to run one version of the software. They all have to agree to a common schedule for when to do disruptions,” Koopmans said. “And they all have to deal with a finite amount of compute capacity with relatively static controls of that capacity. And that makes it very difficult for line of business to get everything” they want.
As a result, the LOB goes off-book and seeks other solutions. This is where the shadow IT providers enter the picture.
“There are a number of shadow IT vendors out there whispering in their ear ‘Come with us, and we’ll help you out,'” Koopmans said. “‘Come to us. We’ll give you your own custom environment and you won’t have to work with IT anymore.'”
While the shadow IT providers may ostensibly deliver on their promise to get the LOB their own custom environment spun up quickly and easily, there are costs to this approach that the LOB may not initially be aware of.
“These are just point solutions, and with point solutions, you create a bunch of problems that the centralized platform solved in the first place,” Koopmans said. “How do you maintain a single source of truth when your data is de-centralized and distributed across a lot of these platforms? How do you build that enterprise fraud detection use case that we talked about before when all of those platforms are not speaking to each other, when there’s no inter-connectivity? How do you maintain your regulatory compliance requirement? How do you build whatever’s next for the next applications, when all you have is point solution and they’re worried that you’re going to be deploying yet another point solution for that sort of thing?”
Koopmans is right, of course. A distributed data and IT workloads environment is much tougher to manage, secure, and govern than a centralized one. The challenge for Cloudera, however, is whether it can successfully deliver a platform that gives LOB what it wants – fast access to data and compute resources, and the ability to spin up and spin down workloads as needed – without giving up the benefits of centralized data management, security, and governance.
Cloudera claims that it has delivered that in the guise of CDP, which the company claims simultaneously gives customers cloud flexibility and central control. “What the answer needs to look like, it needs to run the best of both….and that is exactly what CDP does,” Koopmans said. “It is a platform that gives the customer a compute environment to all the different personas and all the different teams, but it does that without removing the single source of truth.”
No Accounting for Workloads
The customer experience with second-generation Hadoop solutions wasn’t great, admits Anupam Singh, Cloudera’s chief customer officer. “Business wants something and it takes two to four months build a big data cluster,” he said during the Cloudera Analyst Conference. “And that’s just too long. Business wants to do it in weeks.”
Impatient for results, customers have adopted the aforementioned cloud-based Hadoop offerings, as well as Snowflake and Databricks, who Cloudera has painted targets on. Many of Cloudera customers have more than 100 PB stored in their clusters, and consume tens of thousands of hours or compute time on multi-thousand-node clusters crunching data, Singh said. These are expensive workloads on prem, especially for banks, but the costs are higher in the cloud.
When one business unit moves big data workloads to the cloud, others are sure to follow, and the impacts grow in a multiplicative manner, Singh says. “Some of the regulated workloads cost our banking customers hundreds of millions of dollars, and when they move it to the cloud. they don’t’ know what the cost will be,” he said. “On the cloud, it doesn’t seem very expensive. At the end of the year, when the bills come, it’s a whole different level of angst.”
While the bill from the cloud provider is clear, what is less clear is who accessed what. According to Singh, customers have been surprised to learn that when they move to cloud, they no longer have a usage history to analyze how their data was used.
“When this shadow IT happens, when you work with transient computing, all usage history is destroyed,” he said. “When I say this, people think ‘It’s just tables and queries. How bad can it be?'”
One of Cloudera’s customers apparently was no longer to ascertain how their users were interacting with a 400,000 table database after moving to the cloud. “Imagine setting up the security policies for these tables,” he said. “If you don’t have usage history, you don’t know who accessed these tables or what did they do with it.”
Saying Yes to LOB
Cloudera has its work cut out for it. The genie of shadow IT is out of the bottle, and nobody – much less Cloudera – is going to be able to put it back in. Cloud-based SaaS is a fact of life, and the best we can do is find ways to cope.
A recent study by McAfee found that 76% of companies claimed to use multiple infrastructure as a service (IaaS) providers. However, McAfee’s own research shows it’s actually closer to 92%, a figure that up nearly 20% in a year. “Security incidents are almost guaranteed to go under the radar if companies don’t even know where all of their infrastructure lives,” the company warned.
As companies become aware of the potential damage that big data can do, it’s worth noting that Cloudera is offering at least part of a solution. Mick Hollison, who is Cloudera’s chief marketing officer and its chief strategy officer, put it in plain words for the Cloudera AC audience.
“Without talking metrics and numbers anymore, when I think about what drives all this, it’s that IT has a mandate to mitigate risk and ensure that security and privacy concerns are dealt with as it relates to the business,” Hollison said. “The business has a mandate to go fast. The business wants to move quickly. They want to be agile. They want to grow their business. And they just want to do things, whenever they want to do them, wherever they want to do them, and security and risk mitigation are pushed aside. And that causes a real chasm between enterprise IT and the business.”
“We hear over and over again inside our executive briefing center that’ really what it’s created is a culture of enterprise IT having to tell the business, ‘No I can’t do that as fast as you’d like to do it. No I can’t stand up that kind of cloud environment by tomorrow. No, IT doesn’t want to be in that business.’ They want to be in the business of saying yes. And really that’s the premise of everything you’re going to hear today [about what] we built with CDP. It’s all built around the idea of getting to say yes.”