Peak:AIO Cranks the Storage Throughput for Affordable AI Data Serving
Organizations that want to keep their pricey GPUs fed with data for machine learning training purposes but don’t want to break the bank with a big parallel file system installation may be interested in a fast new NFS-based storage offering unveiled today by Peak:AIO, which delivers 80 GB per second of I/O capacity from a 1U server.
Peak:AIO develops server-agnostic data storage systems designed for AI workloads, such as the DGX systems from Nvidia. The British company’s previous iteration of the AI Data Server, which it sells via hardware partners like DellEMC and Supermicro, could deliver 40 GB per second of storage I/O via RDMA atop NFS from a 2U box. With the latest iteration of the AI Data server the company has doubled the data I/O while cutting the size of the box in half, to a 1U system.
It’s all about delivering the biggest storage bang for the customers’ buck, according to PEAK:AIO Founder and CEO Mark Klarzynski. “The key to us really is keeping the funds in the bits that give the return on investment to the user, which is the GPUs,” he says.
Klarzynski founded Peak:AIO in 2019 to take on a new segment of the market. As storage veteran who was instrumental in establishing the software defined storage, Klarzynski has made his market on the space. He did some of the foundational work with iSCSI, Fibre Channel, and InfiniBand with his earlier startups, including some acquired by Tandberg Data and Fusion-iO.
In devising his plan for his latest startup, Peak:AIO, Klarzynski came to realize a large segment of the marketplace was being missed by the big storage vendors. He found the traditional storage vendors were missing the mark when it came to delivering fast and easy-to-use storage for AI training, particularly among startups and smaller firms.
“They were spending significant amount of money on GPUs that we’re going to be underused because they couldn’t get the data,” Klarzynski says. “And because I’m very storage-centric, it took quite a while for this to sink in.”
As AI workloads proliferated, a new class of organizations were adopting high-end processing setups, like NVidia’s DGX systems. A hospital that needs to use computer vision algorithms to detect brain tumors from MRI scans, for example, can justify investing $250,000 in a DGX system. However, when it comes to buying the 50TB to 100TB of high-end NVMe storage that the hospital needed to keep that DGX system fed with data, they were looking at an outlay of $600,000 to $700,000.
“So the thing that gave them the value was a third of the cost of the storage that they didn’t actually care about,” Klarzynski says. “They were never going to back it up, because that was being dealt with elsewhere. They were never going to snapshot it. They couldn’t de-dupe it. They just need it to feed the GPU.”
Klarzynski found inspiration from VAST Data. “They came out with a message that said, look, nobody likes parallel file systems. Let’s make NFS that everybody understands go as fast as parallel file systems,” he says. “And it resonated.”
Thus, Peak:AIO was born. Klarzynski found a market that demanded ultra high-performance NVMe storage atop an NFS file system, but without all the bells and whistles that traditionally accompanies the large storage arrays based on parallel file systems.
Like VAST Data, Peak:AI would stick with NFS, which is easier to manage than a parallel file system. But instead of targeting the enterprise and HPC markets with all the high-end features that those customers demand, Peak:AIO would go after the smaller outfits that just need to keep their GPUs fed from a handful of storage boxes.
The biggest difficulty in developing what would be known as the AI Data Server, Klarzynski says, was making it “Nvidia friendly.” The company adopted the RDMA protocol and standardized on Mellanox adapters to ensure compatibility with how Nvidia wants to connect to data.
“We removed a lot of those features like snapshot, deduplication, replication, that A. weren’t needed and B. added latency within the code, even if they were turned off,” Klarzynski says. “That enabled us to differentiate ourselves little bit….And we spent a lawful lot of work with Nvidia to make sure that we had all that RDMA compatibility.”
With the first iteration of the AI Data Server, PEAK:AIO’s could handle two 200 Gbps RDMA network cards (CX-6) from Mellanox (owned by Nvidia), which delivered 40 GB per second in total I/O capacity, the company says. With the new iteration of the server, PEAK:AIO is supporting CX-7 cards, which supports up to two 400 Gbps cards, delivering 80 GB per second in total I/O.
PCIe5 is critical to delivering that speedup, Klarzynski says, but it took some clever engineering on the part of PEAK:AIO to make efficient use of all that sheer bandwidth.
“The trick is…normally when we measure bandwidth in the normal world, whether or not that’s HPC, enterprise, or big data, we tend to think of it being driven by multiple users or multiple clients,” he says. “Typically in AI, it’s often only one or two. So while just getting the performance and getting it out [was hard], being able to allow one machine to take it off was actually harder, because most of the standard protocols just don’t work that fast. So we had to put a lot of work into it, meaning that we could not only drive 80 GBs, but we could do it on one or two machines, not 10 or 20.”
So far, the message and the product seem to be resonating. Klarzynski says demand for his AI Data Server has so far exceeded his earlier expectations. He attributes that to the faster-than-expected adoption of AI, including large language models. Most PEAK:AIO clients require about 50TB to 150TB of storage, while it gets the occasional order for more than 1PB.
“When you put all those things together, as we did, you get back to that sort of cliched mission statement, which is we made a product that gave them the price, the performance, and the features that they needed or didn’t need, and it just worked,” he says. “And it was fundamentally very simple.”
The new AI Data Server is not GA quite yet, but it is available for testing. Current systems start at $8,000. For more information, see the company’s website at www.peakaio.com.
Object and File Storage Have Merged, But Product Differences Remain, Gartner Says
Why Object Storage Is the Answer to AI’s Biggest Challenge