November 22, 2022

Stanford Researchers Develop HELM Benchmark for Language Models

Jaime Hampton

Foundation models like GPT-3, BLOOM, and BERT have garnered much attention as of late, for good reason. These versatile models, often trained with a vast amount of unstructured data, have immense capabilities that can be adapted to multiple applications, but their homogenous nature can sometimes allow defects to be passed from one to the next.

To shed light on these less understood models, The Center for Research on Foundation Models (CRFM) at Stanford University has developed a new benchmarking approach for large language models called Holistic Evaluation of Language Models (HELM). CRFM scholars have benchmarked 30 language models across a core set of scenarios and metrics under standardized conditions in order to highlight their capabilities and risks.

Intended to serve as a map for the world of language models, HELM will be continually updated over time with new scenarios, metrics, and models through collaboration with the broader AI community, according to CRFM researchers.

The team has highlighted its holistic approach, emphasizing how assessing language models in their totality is necessary for building transparency and achieving the more comprehensive understanding needed to improve and mitigate the societal impact of this technology. The team lists the following three elements of this approach:

Broad coverage and recognition of incompleteness. Given language models’ vast surface of capabilities and risks, we need to evaluate language models over a broad range of scenarios. However, it is not possible to consider all the scenarios, so holistic evaluation should make explicit all the major scenarios and metrics that are missing.
Multi-metric measurement. Societally beneficial systems are characterized by many desiderata, but benchmarking in AI often centers on one (usually accuracy). Holistic evaluation should represent these plural desiderata.
Standardization. Our object of evaluation is the language model, not a scenario-specific system. Therefore, in order to meaningfully compare different LMs, the strategy for adapting an LM to a scenario should be controlled for. Furthermore, we should evaluate all the major LMs on the same scenarios to the extent possible.

Under the HELM benchmark, models are evaluated across a core set of scenarios and metrics under standardized conditions. Source: Stanford University

The researchers ran over 4900 evaluations of different models on different scenarios, amounting to over 12 billion tokens of model inputs and outputs that spans 17 million model calls. Notable findings include how there are consistent performance disparities present in all models, including racialized dialect disparities: “OPT (175B) is the most accurate model on TwitterAAE, but its accuracy degrades from 1.506 bits per byte for White English to 2.114 bits per byte for African American English (lower is better).” The team also found that biases and toxicity in model generations are largely constant across models and are low overall for core scenarios. Additionally, accuracy consistently improves as models become larger, but it comes with higher training and inference costs. The researchers detailed their findings in a scholarly paper available here.

The researchers say that for full transparency, all raw model prompts and completions are released publicly for further analysis. A general modular toolkit is also available for adding new scenarios, models, metrics, and prompting strategies.

To read a blog post with more technical details of the HELM benchmark, visit this link.

Stanford Researchers Detail New Method for Error Detection in Perception Data

BLOOM Large Language Model Breaks Down the Walled Garden

Applications: Artificial Intelligence

Sectors: Academia

Vendors: Stanford University

Tags: benchmark, language models, LLM, research, standards, Stanford University

Only registered users may comment. Register using the form below.

Check off newsletters you would like to receive*
- HPCwire
- EnterpriseTech
- Datanami
- Technology Conferences & Events
- Advanced Computing Job Bank
- Technology Product Showcase
Email*
Name*
First Last
Organization*
Job Function*
Industry*
Country*
City*
State*
Province*
- Please check here to receive valuable email offers from Datanami on behalf of our select partners.

Stanford Researchers Develop HELM Benchmark for Language Models

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 19, 2024

April 18, 2024

April 17, 2024

Sponsored Partner Content

Get your Data AI Ready – Celebrate One Year of Deep Dish Data Virtual Series!

Supercharge Your Data Lake with Spark 3.3

Learn How to Build a Custom Chatbot Using a RAG Workflow in Minutes [Hands-on Demo]

Overcome ETL Bottlenecks with Metadata-driven Integration for the AI Era [Free Guide]

Gartner® Hype Cycle™ for Analytics and Business Intelligence 2023

The Art of Mastering Data Quality for AI and Analytics

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Building an Operational Data Warehouse for Real-time Analytics

Can You Use Kafka as a Database?

Sponsored Multimedia

The Power of DataOps: Bring Automation to Life
No Comments

Tactical Steps for Cloud Migration
No Comments

Immuta Data Access Platform
No Comments

Data Mesh: Fact or Fiction?
No Comments

Contributors

Featured Events

Call & Contact Center Expo

AI & Big Data Expo North America 2024

AI Hardware & Edge AI Summit 2024

CDAO Government 2024

Stanford Researchers Develop HELM Benchmark for Language Models

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 19, 2024

April 18, 2024

April 17, 2024

Most Read Features

Most Read News In Brief

Most Read This Just In

Sponsored Partner Content

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Sponsored Multimedia

Contributors

Featured Events

Share

Copy short link