Follow Datanami:
February 12, 2024

The Future of AI Is Hybrid

(JLStock/Shutterstock)

Artificial intelligence today is largely something that occurs in the cloud, where huge AI models are trained and deployed on massive racks of GPUs. But as AI makes its inevitable migration into to the applications and devices that people use every day, it will need to run on smaller compute devices deployed to the edge and connected to the cloud in a hybrid manner.

That’s the prediction of Luis Ceze, the University of Washington computer science professor and Octo AI CEO, who has closely watched the AI space evolve over the past few years. According to Ceze, AI workloads will need to break out of the cloud and run locally if it’s going to have the impact foreseen by many.

In a recent interview with Datanami, Ceze gave several reasons for this shift. For starters, the Great GPU Squeeze is forcing AI practitioners to search for compute wherever they can find it. find new making the edge look downright hospitable today, he.

“If you think about the potential here, it’s that we’re going to use generative AI models for pretty much every interaction with computers,” Ceze says. “Where are we going to get compute capacity for all of that? There’s not enough GPUs in the cloud, so naturally you want to start making use of edge devices.”

Luis Ceze is the CEO of OctoAI

Enterprise-level GPUs from Nvidia continue to push the bounds of accelerated compute, but edge devices are also seeing big speed-ups in compute capacity, Ceze says. Apple and Android devices are often equipped with GPUs and other AI accelerators, which will provide the compute capacity for local inferencing.

The network latency involved with relying on cloud data center to power AI experiences is another factor pushing AI toward a hybrid model, Ceze says.

“You can’t make the speed of light faster and you cannot make connectivity be absolutely guaranteed,” he says. “That means that running locally becomes  a requirement, if you think about latency, connectivity, and availability.”

Early GenAI adopters often chain multiple models together when developing AI applications, and that is only accelerating. Whether it’s OpenAI’s massive GPT models, Meta’s popular Llama models, the Mistral image generator, or any of the thousands of other open source models available on Huggingface, the future is shaping up to be multi-model.

The same type of framework flexibility that enables a single app to utilize multiple AI models also enables a hybrid AI infrastructure that mixes on-prem and cloud models, Ceze says. It’s not that it doesn’t matter where the model is running; it does matter. But developers will have options to run locally or in the cloud.

“People are building with a cocktail of models that talk to each other,” he says. “Rarely it’s just a single model. Some of these models could run locally when they can, when there’s some constraints for things like privacy and security…But when the compute capabilities and the model capabilities that can run on the edge device aren’t sufficient, then you run on the cloud.”

At the University of Washington, Ceze led the team that created Apache TVM (Tensor Virtual Machine), which is an open source machine learning compiler framework that allows AI models to run on different CPUs, GPUs, and other accelerators. That team, now at OctoAI, maintains TVM and uses it to provide cloud portability of its AI service.

“We been heavily involved with enabling AI to run on a broad range of devices. And our commercial products evolved to be the OctoAI platform. I’m very proud of what we build there,” Ceze says. “But there’s definitely clear opportunities now for us to enable models to run locally and then connect it to the cloud, and that’s something that we’ve been doing a lot of public research on.

(IM-Imagery/Shutterstock)

In addition TVM, other tools and frameworks are emerging to enable AI models to run on local devices, such as MLC LLM and Google’s MLIR project. According to Ceze, what the industry needs now is a layer to coordinate the models running on prem and in the cloud.

“The lowest layer of the stack is what we have a history of building, so these are AI compilers, runtime systems, etc.,” he says. “That’s what fundamentally allows you to use the silicon well to run these models. But on top of that, you still need some orchestration layer that figures out when should you call to the cloud? And when you call to the cloud, there’s a whole serving stack.”

The future of AI development will parallel Web development over the past quarter century, where all the processing except HTML rendering started out on the server, but gradually shifted to running on the client device too, Ceze says.

“The very first Web browsers were very dumb. They didn’t run anything. Everything ran on the server side,” he says. “But then as things evolved, more and more of the code started running in the browser itself. Today, if you’re going to run Gmail and run Google Lives in your browser, there’ a gigantic amount of code that gets downloaded and runs on your browser. And a lot of the logic runs in your browser and then you go to the server as needed.”

“I think that’s going to happen in AI, as well with generative AI,” Ceze continues. “It will start with, okay this thing entirely [runs on] massive farms of GPUs in the cloud. But as these innovations occur, like smaller models, our runtime system stack, plus the AI compute capability on phones and better compute in general, allows you to now shift some of that code to running locally.”

Large language models are already running on local devices. OctoAI recently demonstrated Llama2 7B and 13B running on a phone. There’s not enough storage and memory to run some of the larger LLMs on personal devices, but modern smartphones can have 1TB of storage and plenty of AI accelerators to run a variety of models, Ceze says.

That doesn’t mean that everything will run locally. The cloud will always be essential to building and training models, Ceze says. Large-scale inferencing will also be relegated to massive cloud data centers, he says. All the cloud giants are developing their own custom processors to handle this, from AWS with Inferentia and Trainium to Google Cloud’s TPUs to Microsoft Azure Maia.

“Some models would run locally and then they would just call out to models in the cloud when they need compute capabilities beyond what the edge device can do, or when they need data that’s not available locally,” he says. “The future is hybrid.”

Related Items:

The Perfect Storm: How the Chip Shortage Will Impact AI Development

Birds Aren’t Real. And Neither Is MLOps

Beyond the Moat: Powerful Open-Source AI Models Just There for the Taking

Datanami