Follow Datanami:
October 16, 2019

Simplifying the Big Data Lake Experiences in the Cloud

(Phonlamai Photo/Shutterstock)

The cloud is a hot spot for big data lakes these days, thanks largely to the greater technological simplicity and lower upfront costs of getting started in the public cloud. But as organizations grow their cloud data lakes and use cases, the cost and complexity starts to build. That’s why some organizations are turning to software as a service (SaaS) providers like Qubole and Cazena to insulate them from the underlying cloud platform.

The market for on-premise Hadoop software has stalled in recent months, and the public cloud providers have arguably been the biggest beneficiaries. Amazon Web Services, Microsoft Azure, and Google Cloud Platform have taken the same open source projects that drove the growth of Cloudera, Hortonworks, and MapR Technologies, but connected them to their own cloud storage and processing paradigms (which is where they make their money).

Customers can run on “bare metal” on EC2 and similar infrastructure as a service platforms, or they can go “up stack” and utilize the cloud vendors’ Hadoop stacks via the platform as a service (PaaS) manner. These offerings — AWS’s Elastic MapReduce  (EMR), Microsoft Azure HDInsight. And GCP’s Cloud DataProc — are gaining lots of customers, thanks to the frustration that on-prem Hadoop customers have felt.

While running Hadoop in the cloud as a IaaS or PaaS manner eliminates the need for customers buy expensive servers and hire infrastructure teams to run them, it doesn’t get the organizations completely out of the woods. They still need experts to manage the software and all the other requirements that go with it – experts in security, performance, access, etc.

That’s helping to drive the market for PaaS solutions, says Hannah Smalltree, the vice president of marketing for Cazena, which develops a software as service (SaaS) big data lake in the cloud. According to Smalltree, building and managing a data lake in the cloud is still too difficult for the average customer.

“The idea with Cazena is we’re bundling together that infrastructure – the data and analytics engines, security, networking, everything — and delivering it as a fully formed, ready for data ingest, SaaS data lake,” she tells Datanami. “What’s different about our data lakes is, because they’re SaaS, people don’t need to have a full team to support. They typically have somebody who manages users and permissions, and then a data engineer helps with the data loading, data governance things.”

Cazena says its SaaS approach to data lakes saves customers large amounts of time and money

Cazena currently runs atop AWS and Azure, and partners with Cloudera to help customers deploy the Cloudera Distribution of Hadoop (CDH) software into the clouds (it’s working on supporting the vendor’s new Cloudera Data Platform [CDP]). Cazena manages the Hadoop distribution as an end-to-end platform, providing a pure SaaS experience for customers.

“We do the whole thing from ingestion to storage to cloud accounts to actually picking engines,” says Prat Moghe, Cazena’s CEO and founder. “Basically you don’t need people to run and manage these data lakes because you require special skills in terms of security, in terms of ongoing health management.”

Cazena is teaming up with other vendors to give their customers more cloud options and hopefully simplify their cloud journeys. It has formed partnerships with DataRobot, Streamsets, Arcadia Data (acquired by Cloudera), and Accelerite and is looking for more.

“We can essentially team up with other PaaS vendors and make it consumable by the mass market,” says Moghe, who was the SVP of products, strategy, and marketing for Netezza before it was bought by IBM. “Early adopters that are on the cloud have skills. But the majority of the market that’s behind the followers who are just coming on don’t. If we can make it easy, I think it’s a win-win.”

Qubole’s Cloud Growth

Another firm treading the cloud waters is Qubole. Co-founded by former Facebook engineers Ashish Thusoo and Joydeep Sen Sarma (also the co-creators of Apache Hive), Qubole has quietly built itself into a force to be reckoned with in the big data as a service market.

Part of Qubole’s advantage is experience. Thusoo and Sen Sarma founded the company in 2011, which was just two years after Amazon introduced EMR. Being in the business for so long has allowed Qubole to develop cloud-native approaches for offering big data processing on the cloud, and also for developing solid approaches for managing customers’ underlying cloud compute resources. It’s also attracted marquee customers like Lyft, Zillow, Nextdoor, Gannet, and Warner Music Group.

Qubole supports a handful of the most popular big data compute engines natively on four public clouds

With support for all three public clouds plus Oracle Cloud, Qubole’s mantra is all about giving customers the freedom of choice, says Utpal Bhatt, SVP of marketing for the Santa Clara, California-based company.

“If you’re in the world of analytics and machine learning, there are several different choices you have from an engine perspective,” Bhatt says. “There’s Spark, TensorFlow, Hive, Presto, etc. Our view is that there isn’t a single engine that will meet all your use cases. So we embrace all of them.     Our cloud native approach is we take the open source and we re-engineer it — essentially optimizing it — for performance and efficiency.”

Qubole shields the users from the underlying complexity of dealing with these different engines, which are best-of-breed in the big data world, according to Jose Villacis, senior director of product marketing for Qubole.

“Qubole has been able to abstract users and administrators from having to do this grunt work of getting these engines up and running,” Villacis says. “The simplicity of getting up and running in just a few minutes is really what makes the big difference between something that is nascent versus something that has already gone through the process of fine-tuning.”

With its new cloud-first mantra and the launch of CDP on AWS this year (to be followed by Azure and GCP), Cloudera is adapting to the new marketplace realities. With CDP on the cloud, Cloudera is partnering and competing with the public cloud vendors, not to mention Databricks and Snowflake.

Cloudera is also competing with Qubole for big data as a service business, but it’s also partnering with Qubole on Hive’s support for ACID transactional guarantees. But with 200-plus customers around the world on four public cloud platforms, Qubole is taking a wait and see approach to CDP.

“From a Cloudera perspective, the proof is going to be in the pudding,” Bhatt says. “The simple auto-scaling has been solved, but things like workload-aware auto-scaling, which is what we do and which inspects every job and the requirements of every job — I think that level of sophistication can take a long time.”

The cloud offers benefits for organizations seeking to collect and process big data. But being on the cloud doesn’t necessarily mean you have to use the public cloud vendors’ offerings.

Related Items:

Cloudera Begins New Cloud Era with CDP Launch

Presto Use Surges, Qubole Finds

Why the Cloud and Big Data? Why Now?