Follow Datanami:
March 14, 2024

AWS Delivers ‘Lightning’ Fast LLM Checkpointing for PyTorch

AWS customers who are training large language models (LLMs) will be able to complete their model checkpoints up to 40% faster thanks to improvements AWS has made with its Amazon S3 PyTorch Lightning Connector. The company also made updates to other file services, including Mountpoint, the Elastic File System, and Amazon S3 on Outposts.

The process of checkpointing LLMs has emerged as one of the biggest bottlenecks in developing generative AI applications. While the data sets used in training LLMs are relatively small–on the order of 100GB–the LLMs themselves are quite large, and so are the GPU clusters used to train them.

Training big LLMs on these massive GPU clusters can take months, as the models go over the training data again and again, refining their weights. To protect their work, GenAI developers backup the LLMs, or checkpoint them, on a regular basis.

It’s somewhat like 1980’s high performance computing, said AWS Distinguished Engineer Andy Warfield.

“They have a big distributed system that they’re building the model on, and they have enough hosts that the GPU hosts fail,” Warfield told Datanami. “Either they have bugs in their own software or a service failed. They’re running these things for thousands of servers, potentially months at a time for some of the big LLMs. You don’t want to lose the entire job two weeks in if you fail a GPU.”

S3 is the standard protocol for accessing objects

The quicker the checkpoint is done, the quicker the customer can get back to training their LLM and developing the GenAI product or service. Warfield and his team of engineers set out to find ways to speed up the checkpointing of these models to Amazon S3, the company’s massive object store.

The speedup was delivered as an update to Amazon S3 Connector for PyTorch, which it launched last fall at re:Invent. The connector provides a very fast method to move data between S3 and PyTorch, the popular AI framework used to develop AI models, including GenAI models.

Specifically, the Amazon S3 Connector for PyTorch now supports PyTorch Lightning, the faster, easier to use version of the popular machine learning framework. The connector uses AWS’s Common Runtime, or CRT, which is a group of open source, client-side libraries for the REST API that AWS has written in C and which function like a “souped-up SDK,” Warfield told us last fall.

The connector provides lightning-fast data movement, according to Warfield. In fact, it’s so fast that, at first, he had a hard time believing it.

“The team was working on the PyTorch connector and they were benchmarking how quickly they could write checkpoints out to S3,” he explains. “And their baseline for the benchmark was, they were using a GPU instance with instance storage. So they were writing the checkpoints out to local SSD.

“Local SSD is obviously pretty darn fast,” he continued. “So they came back and said ‘Andy, check out our results. We are faster writing checkpoints to S3 than we are writing to the local SSD.’ And I was like, guys, I call BS on this. There’s no way you’re beating the local SSD for these checkpoints!”


After investigating what occurred and rerunning the test, the testers were proven correct. It turns out that moving data to a single SSD, even when it’s connected via the internal PCIe bus, is slower than moving the data over network interface controller (NIC) cards to S3.

“The punch line was that the SSD is actually PCIe-lane limited,” he said. “There are fewer PCIe lanes to the SSD than there are to the NIC. And so by parallelizing the connections out to S3, S3 was actually higher throughput on the PCIe bus, on the host, than this one local SSD that they were writing to. And so it was kind of a cool result.”

In other file system news, AWS is boasting a 2x increase in performance for Amazon Elastic File System (Amazon EFS), the multi-tenant file system service that exposes the NFS protocol for POSIX-compliant applications. The service, which AWS launched in 2019, lets users scale up or down as needed.

EFS customers can now expect to read files at speeds up to 20 GB/s for and write files to EFS at speeds up to 5 GB/s. The company says that makes EFS more usable for workloads with high-throughput file access requirements, such as machine learning, genomics, and data analytics applications.

“It’s just an example of the continuous work that the teams do on improving performance,” Warfield said. “This is just a bump in the maximum performance that you get out of these systems that we’re pushing through all the time. It just opens up the network.”

EFS can’t yet deliver the data throughput that a system like Amazon FSx for Netapp ONTAP, which the company also improved earlier this month. AWS also cranked the performance dial for its ONTAP file service by 2x, giving customers a maximum of 72 GB/s throughput.

The difference between FSx for NetApp ONTAP and EFS, Warfield explained, is that the ONTAP file service runs on dedicated hardware sitting in an AWS data center, whereas EFS is a shared, multi-tenant service. The NetApp team has also been developing their file system for about three decades, while EFS is about 15 years old, he added, but EFS is evolving quickly.

“If you look at the announcements that we’ve made on EFS over the past two years in particular, the cadence of performance and latency and throughput improvements on EFS…it’s moving quite fast.”

Another method AWS customers use to connect S3 to their apps is via the Mountpoint service, another component of the CRT that exposes an HDFS interface to the outside world (for Hadoop MapReduce or Spark jobs) and talks S3 inside AWS data centers.

Today AWS launched a new Mountpoint for Amazon S3 Container Storage Interface (CSI) driver for Bottlerocket, the free and open source version of Linux for hosting containers. The new driver makes it easy for customers running apps in Amazon Elastic Kubernetes Service (Amazon EKS) or self-managed Kubernetes clusters to connect them to S3, without making application code changes.

“Our whole intention with this and this stuff is to just make it as easy as possible to bring whatever tool you want to your data and not have to think about that,” Warfield said.

Finally, AWS also announced the addition of application caching for Amazon S3 on Outposts, the service for customers running AWS hardware on-prem. With this release, AWS has removed the necessity of making a round-trip from the customer’s premise to the AWS data center for every request, thereby reducing network latency.

AWS made these announcements today in honor of the 18th anniversary of the launch of Amazon S3, which happens to be Pi Day. For more info, check out AWS’ Pi Day blog.

Related Items:

Inside AWS’s Plans to Make S3 Faster and Better

AWS Launches High-Speed Amazon S3 Express One Zone

AWS Plots Zero-ETL Connections to Azure and Google