Groq Achieves Doubling of Llama-2 70B LLM Inference Performance in 3 Weeks
MOUNTAIN VIEW, Calif., Aug. 31, 2023 — Groq, an artificial intelligence (AI) solutions provider, today announced it has more than doubled its inference performance of the Large Language Model (LLM), Llama-2 70B, in just three weeks and is now running at more than 240 tokens per second (T/s) per user on its LPU system. As mentioned in its previous press release, Groq was the first to achieve 100T/s per user for Llama-2 70B.
Now that Groq has broken the performance record twice, generating a user experience for language responses at over 240T/s per user, is it possible that there could be room for more performance improvement on their first-gen 14nm silicon fabbed in the US?
Jonathan Ross, CEO and founder of Groq, shared, “Groq broke a record a few weeks ago by being the first to hit 100 tokens per second per user on Llama-2 70B–a record that no one has responded to with competitive performance. Today, we announce 240T/s per user! It’s becoming unclear if GPUs can keep up with the Groq Language Processing Unit (LPU) system on Large Language Models.”
Jay Zaveri, Social Capital partner, founder of Dropbox-acquired CloudOn, and Groq Board Member, commented, “The ultimate language processing system combines great software, programmability, ease of use, scalability, wrapped over a best-in-class processor. Groq has been building such a system quietly the last few years and has superior token throughput, token per dollar, and token per watt. While others may try to catch up, Groq is well on its way to roll out its systems to the people and customers who matter–developers who are building the future of AI.”
In private demo showings, Groq customers are seeing a new world of possibilities, going as far as to say that Groq solutions are making them consider new low latency LLM use cases for their verticals. For example, LLMs deployed to monitor large amounts of text data, from sources like online forums and social media, can help rapidly detect potential cyberattacks or security breaches. Ultra-low latency is essential to ensure real-time analysis and response, playing a pivotal role in safeguarding sensitive information, critical infrastructure, and national security interests.
Additionally, LLMs can be deployed to transform local emergency responses during natural disasters. Using real-time data from social media, emergency calls, or weather reports, the models can identify critical geographic areas needing assistance, predict threats, and provide accurate guidance to first responders and affected communities. Ultra-low latency can mean quick delivery of life-saving information, better prepared disaster management, and increased public trust. With real-time, fluid user experiences and using the most current and valuable data available, LLMs will continue to dominate more and more of the AI market and create impact with real-world applications.
If you’re interested in seeing Llama-2 70B running at 240T/s per user on Groq, register for GroqDay, a hybrid event on September 7th at 1:00pm PDT–you can attend virtually or in-person if you are local to the Bay Area. If you’d like to schedule an exclusive one-on-one demo, reach out to [email protected].
Groq is an AI solutions company delivering ultra-low latency inference with the first ever Language Processing Unit. For more information, visit www.groq.com.