Follow Datanami:
May 14, 2024

New GenAI Models On Tap from Google, OpenAI


OpenAI and Google released major updates to their AI models this week, including OpenAI’s release of GPT-4o, which adds audio interactions to the popular LLM, and Google’s launch of Gemini 1.5 Flash and Project Astra, among other news.

The Internet was rife with speculation late last week that OpenAI was on the cusp of launching a new search service that would rival Google. OpenAI CEO Sam Altman quashed those rumors, but did state Friday that Monday’s product announcement would be “magical.”

It’s not clear if GPT-4o counts as magical yet, but by all accounts, it does represent a solid, if incremental, improvement over for the world’s most-popular large language model (LLM), GPT-4.

The key deliverable with GPT-4o (the “o” stands for “omni”) is the ability to interact with the LLM verbally, a la services like Apple Siri and Amazon Alexa.

OpenAI aims to enable users to carry on a natural conversation with GPT-4o

According to OpenAI’s May 13 blog post, the new model can respond to audio inputs within about 230 milliseconds, with an average of 320 milliseconds. That “is similar to human response time in a conversation,” the company says. It’s also much faster than the “voice mode” that OpenAI previously supported, which offered latencies of 2.8 to 5.4 seconds (which isn’t really usable).

GPT-4o is a new model trained end-to-end across text, vision, and audio, making it the first OpenAI model that combines all of these modalities. It matches the performance of GPT-4 Turbo performance for understanding and generating English text and code-generation, the company says, “while also being much faster and 50% cheaper in the API.”

Meanwhile, Google also had some GenAI news to share from its annual developer conference, Google I/O. The news centers primarily around Gemini, the company’s flagship multi-modal generative AI model.

First up is Gemini 1.5 Flash, a lightweight version of Gemini 1.5 Pro, which the company launched earlier this year. Gemini 1.5 Pro sports a 1 million token context window, which is the biggest context window in the industry. However, concerns over the latencies and costs associated with such a powerful model sent Google back to the drawing board, where they came up with Gemini 1.5 Flash.

Project Astra aims to create “universal AI agents” that perceive the world more like humans

Meanwhile, Google bolstered Gemini 1.5 Pro with a 2 million token context window. It also “enhanced its code generation, logical reasoning and planning, multi-turn conversation, and audio and image understanding through data and algorithmic advances,” says writes Demis Hassabi, the CEO of Google’s DeepMind, in a blog post.

Google also announced the launch Project Astra, a new endeavor to create “universal AI agents.” Astra, which stands for “advanced seeing and talking responsive agents,” aims to move the ball forward in creating agents that understand and respond to the complex world around them like people, and also remember what it’s heard and understand the context–in short, make artificial agents more human-like.

“While we’ve made incredible progress developing AI systems that can understand multimodal information, getting response time down to something conversational is a difficult engineering challenge,” Hassabi says. “Over the past few years, we’ve been working to improve how our models perceive, reason and converse to make the pace and quality of interaction feel more natural.”

Related Items:

Google Cloud Bolsters AI Options At Next ’24

Has GPT-4 Ignited the Fuse of Artificial General Intelligence?

Google Launches Gemini, Its Largest and Most Capable AI Model