Follow Datanami:
November 2, 2023

Cloudera Makes a Move in GenAI with Pinecone Partnership


Cloudera customers have been working with large language models (LLMs) and building generative AI applications for some time. Today, the cloud data management vendor unveiled a partnership with vector database leader Pinecone that’s aimed at accelerating that GenAI work and putting its own stamp on the emerging market under new CEO Charles Sansbury. The company also unveiled results of a GenAI study.

Pinecone is one of the more established providers of vector databases, which has become one of the hottest sectors of the database market since ChatGPT burst onto the scene nearly a year ago, triggering a tsunami of GenAI activity.

As part of its partnership with Cloudera, the two vendors have worked to integrate Pinecone’s vector database into the Cloudera Data Platform (CDP) with the ultimate goal of making it easier for CDP customers to build GenAI applications. While customers must purchase CDP and Pinecone separately, the integration is delivered by Cloudera via something called an Applied Machine Learning Prototype, or an AMP.

The Pinecone AMP, when combined with other necessities for GenAI that customers have already installed on CDP–such as an LLM from Huggingface, Meta AI, Anthropic, or Cohere, as well as a data pipeline powered by Apache NiFi–helps users develop and deploy GenAI applications directly on CDP, says Abhas Ricky, Cloudera’s chief strategy officer.

“So what [the AMP] does is it allows developers to quickly create and augment new knowledgebases from data on their website, as well as some pre-built connectors that will enable you as a customer to quickly set up ingest pipelines for all AI applications,” Abhas tells Datanami. “So in this specific instance, the AMP and the Pinecone vector database use the knowledgebases, and then you can imbue the context into the chatbot responses, basically ensuring that you can get useful outputs, so the fidelity of the outputs becomes much higher.”

In addition to lowering hallucination rates by tapping into the “enterprise context” that exists in the customers data, the integration will help drive better performance and lower cost, Abhas says. Those are some of the overall goals that Cloudera has set for itself as it tries to deliver GenAI capabilities to its Global 2000 customers.

There are three things that customers want for GenAI applications, the Cloudera CSO says. “Number one is enterprise context, because everyone wants to develop their own GPT trained on their enterprise context,” he says.

The second is trust. “Everyone wants to be able to trust the data they’re going to use to train their models,” he says,” and therefore they’re coming to us and saying that, hey, we want to work with you for the governance features and the metadata authorization and the audit capabilities.”

Lastly, CDP customers want Cloudera to help it bolster performance. “People are coming to us for compute,” Abhas says. “We are also partnering with hardware providers out there for hardware acceleration. There is a customer who told us ‘We run generative AI use cases on GPUs on private cloud and that have saved us 30% to 35% on TCO.’ And that’s a massive reduction because they’re spending tens of millions of dollars a month on that.”


Cloudera, which is holding its Evolve New York conference this week in part to introduce new CEO Sansbury, is establishing partnership with other vendors to help drive its GenAI strategy. That includes AWS and the vector database capabilities in Amazon Bedrock, and it may establish partnerships with other vector database providers in the future, Abhas says.

The former Hadoop distributor is also counting on its utilization of the Apache Iceberg table format as way to enable its customers to safely interact with data stored on CDP in a number of different ways, from SQL analytics to training and deploying GenAI applications.

“Iceberg is very key to us,” Abhas says. “We’re all in on Iceberg insofar as our open data lakehouse strategy is concerned, because we want to be staying through the open source ethos and we believe that will help us integrate better with partners, but also help joint customers navigate the world which is outside of the walled garden of Cloudera. So that’s a bridging layer for us.  We have these pre-built data flow ReadyFlows into the Iceberg tables so you can leverage that.”

The company released results of a survey of 500 American IT decision makers and data scientists about their company’s plans for GenAI applications.

The survey found that 53% of survey-respondents are currently using GenAI technology, and an additional 36% are in the early stages of exploring AI for potential implementation in the next year.

However, 84% said they are concerned about sharing data with third parties for training or fine-tuning of GenAI models, according to Cloudera, which characterized the overall attitude around GenAI environment as “a still untamed, Wild West-like environment when it comes to data privacy, security, and compliance.”

Cloudera Sees Iceberg Everywhere

Cloudera: Over 25 Million Terabytes Served

When GenAI Hype Exceeds GenAI Reality