HELMET: Holistically Evaluating Long-Context Language Models

April 22, 2025

102

“Introducing Huggingface HELMET: Holistically Evaluating Long-context Language Models.”

A general overview of HELMET (Holistically Evaluating Long-Context Language Models), a new benchmark created to evaluate language models with extended context windows in a thorough manner. It outlines the reasons for the creation of HELMET, its salient characteristics and enhancements over current standards, and the preliminary results of testing a variety of long-context LMs (LCLMs).

Holistically Evaluating Long-Context Language Models Overview

Long-Context Language Models’ (LCLMs) Ascent and Significance

With context windows far bigger than those of typical models (2K–8K tokens), LCLMs have “immense potential to change the way it use and interact with language models.” Recent models such as GPT-4o, Claude-3, and Gemini 1.5 include extended context windows (up to millions of tokens).

Existing benchmarks show counterintuitive trends, such as smaller models outperforming larger ones (e.g., Llama-3.1 8B > 70B) — Image Credit To huggingface

Effectively assessing these Long-context Language Models(LCLMs) is difficult but essential to comprehending their full potential and directing future research.

Drawbacks of Current Benchmarks and Evaluation Techniques

Contrary findings: There is a lack of trustworthy evaluation, as evidenced by the fact that existing benchmarks occasionally show smaller models outperforming bigger ones (e.g., Llama-3.1 8B > 70B).

Traditional benchmarks’ inadequacy: Scrolls and other natural language benchmarks are no longer appropriate for the broader context of LCLMs.

Overuse of synthetic tasks and perplexity: Although perplexity and synthetic tasks, such as “needle-in-a-haystack” (NIAH), have gained popularity, they frequently don’t accurately represent performance in the actual world. According to Hugging Face, “recent works have shown that perplexity does not correlate well with downstream performance (Fang et al., 2024).” According to it findings, “synthetic tasks like NIAH do not correlate with real-world performance.”

Diverse datasets and a lack of standardization: Model developers frequently test their models on various arbitrary datasets, which makes direct comparisons challenging.

Particular drawbacks of the current realistic benchmarks (InfiniteBench, LongBench, and ZeroScrolls):inadequate attention to downstream tasks: frequently concentrated on specific domains.

Insufficient lengths: A lot of datasets are only available in contexts with fewer than 32K tokens, which is insufficient for assessing frontier Long-context Language Models(LCLMs).

Measures that are unreliable: Using n-gram matching measures such as ROUGE, which “do not correlate with human judgements (Goyal et al., 2023) and do not distinguish between models.”

Instruction-tuning is necessary for several of the base models.

Presenting HELMET: An All-Inclusive LCLM Benchmark

To overcome the shortcomings of current approaches, a new benchmark called HELMET (How to Evaluate Long-context Language Models Effectively and Thoroughly) is suggested.

The following are the main requirements for its design:

Coverage of downstream tasks that is varied: comprising summarization, generation with citations, and retrieval-augmented generation.

Length and complexity control: Enables evaluation for input lengths ranging from 8K to 128K tokens (and readily expandable).

Reliable assessment of instruction-tuned and basic models: employing human research and model-based assessments, which have a stronger association with human judgements than conventional measures.

HELMET’s Main Advantages Over Current Benchmarks

Broad Coverage: A greater variety of jobs with inherently lengthy contexts that reflect real-world applications are included in HELMET. It makes use of “reliable evaluation settings, such as model-based evaluations and human studies.”

Controllable Length and Difficulty: The benchmark permits the adjustment of input length in a variety of tasks using multiple mechanisms (e.g., document length, number of retrieved passages, demonstrations). To test frontier models, the authors purposefully selected datasets with lengthy natural documents.

HELMET places a high priority on “model-based evaluations that show better distinguishability between models and different input lengths” and verifies their dependability through “human studies.” They specifically abandon noisy n-gram measures such as ROUGE. A diagram (diagram 3) in the document shows how model-based evaluations are successful but ROUGE is unable to distinguish.

ROUGE cannot differentiate between models and lengths, while model-based evaluations are better at separating models of different capacities. — Image Credit To huggingface

Robust Prompting: HELMET is more representative of real-world applications where base models are essential by allowing the evaluation of both instruction-tuned and base models using in-context learning instances for a subset of tasks.

Important Results from HELMET Assessments (for 59 LCLMs)

Diverse assessment is essential: Low correlations between performance across many task categories (such as RAG, summarization, and citation) suggest that assessing on a single task is insufficient to comprehend total capabilities. ICL has the lowest connection with other tasks, suggesting it is distinct and requires different model skills.

As task complexity and length increase, models deteriorate: As input length rises, even sophisticated models like GPT-4o and Gemini perform worse on challenging tasks like re-ranking. Synthetic task performance does not show this performance change.

On complicated tasks, open-source models perform worse than closed-source models: On easier jobs, the difference could be slight, but on more difficult ones, like citation production, it grows considerably.

No one model is the “best” in every category: This emphasizes the necessity of thorough assessment along a variety of dimensions.

Applying HELMET to Upcoming Advancements

Simple to use: A GitHub repository with an easy-to-follow setup makes the benchmark available.
Flexible model loading: Allows for a number of model integration techniques, such as:
- HuggingFace’s library of converters

python eval.py --config configs/rag.yaml --model_name_or_path <model_name>

The TGI of HuggingFace

input_max_length: 131072
datasets: kilt_nq
generation_max_length: 20
test_files: data/kilt/nq-dev-multikilt_1000_k1000_dep6.jsonl
demo_files: data/kilt/nq-train-multikilt_1000_k3_dep6.jsonl
use_chat_template: true
max_test_samples: 100
shots: 2
stop_new_line: true
model_name_or_path: tgi:meta-llama/Llama-3.1-8B-Instruct # need to add "tgi:" prefix
use_tgi_serving: true # add this line in your config

Then use the command below to run the benchmark

export LLM_ENPOINT=<your-tgi-endpoint> # example: "https://10.10.10.1:8080/v1" python eval.py --config configs/config.yaml --endpoint_url $LLM_ENDPOINT

The Inference Endpoints from HuggingFace (which supports Intel Gaudi accelerators)

export LLM_ENPOINT=<your-hf-inference-endpoint> # example: "https://XXXX.us-east-1.aws.endpoints.huggingface.cloud/v1"
export API_KEY=<your-hf-api-key>
python eval.py --config configs/config.yaml --endpoint_url $LLM_ENDPOINT --api_key $API_KEY

Using vllm

model_name_or_path: meta-llama/Llama-3.1-8B-Instruct # no prefix needed
use_vllm_serving: true # use vllm instead of tgi

Then use the command below to run the benchmark

export LLM_ENPOINT=<your-vllm-endpoint>
python eval.py --config configs/config.yaml --endpoint_url $LLM_ENDPOINT

APIs for model providers (OpenAI, TogetherAI, Anthropic, and Google)
Faster development cycles: Because the RAG and Recall tasks balance speed and correlate with other practical activities, they are recommended for rapid iteration.
Direct comparison with current models: Without having to conduct in-depth analyses themselves, researchers can use HELMET’s findings on 59 models to compare their models directly. They have a leaderboard on their website.

Looking Ahead: LongProc Integration

LongProc, a benchmark designed especially for assessing Long-context Language Models(LCLMs) on long-form generation and procedure following, was just published by the authors.

In order to offer an even more complete assessment suite, they are attempting to include LongProc into HELMET, especially for tasks that call for extremely lengthy outputs (up to 8K tokens).

In summary

An important development in the assessment of long-context language models is HELMET. HELMET offers a more comprehensive and accurate understanding of LCLM capabilities by overcoming the drawbacks of previous benchmarks with its varied tasks, configurable length, trustworthy assessment measures, and support for both basic and instruction-tuned models.

The preliminary results demonstrate that even state-of-the-art models encounter difficulties as task complexity and context length increase, underscoring the significance of varied evaluation. In order to efficiently evaluate and distinguish between Long-context Language Models(LCLMs), HELMET provides academics and practitioners with a useful tool that will ultimately advance this quickly developing subject.

HELMET: Holistically Evaluating Long-Context Language Models

Holistically Evaluating Long-Context Language Models Overview

Presenting HELMET: An All-Inclusive LCLM Benchmark

HELMET’s Main Advantages Over Current Benchmarks

Important Results from HELMET Assessments (for 59 LCLMs)

Applying HELMET to Upcoming Advancements

Looking Ahead: LongProc Integration

In summary

Google Magic Mirror Experience Driven by Gemini Models

Pluto AI: A New Internal AI Platform For Enterprise Growth

Bolttech Improves Customer Experience with AWS Generative AI

LEAVE A REPLY Cancel reply

Page Content

Recent Posts

AMD Radeon Pro W6600 Benchmark in CAD, Video Editing

Intel Core Ultra 5 225H Performance for Everyday Tasks

Intel Core i9 13900K Price, Benchmark, and Specifications

NVIDIA Tesla V100 Price, Features And Specifications

Google Magic Mirror Experience Driven by Gemini Models

Pluto AI: A New Internal AI Platform For Enterprise Growth

About Us

Tutorials