“Introducing Huggingface HELMET: Holistically Evaluating Long-context Language Models.”
A general overview of HELMET (Holistically Evaluating Long-Context Language Models), a new benchmark created to evaluate language models with extended context windows in a thorough manner. It outlines the reasons for the creation of HELMET, its salient characteristics and enhancements over current standards, and the preliminary results of testing a variety of long-context LMs (LCLMs).
Holistically Evaluating Long-Context Language Models Overview
Long-Context Language Models’ (LCLMs) Ascent and Significance
With context windows far bigger than those of typical models (2K–8K tokens), LCLMs have “immense potential to change the way it use and interact with language models.” Recent models such as GPT-4o, Claude-3, and Gemini 1.5 include extended context windows (up to millions of tokens).

Effectively assessing these Long-context Language Models(LCLMs) is difficult but essential to comprehending their full potential and directing future research.
Drawbacks of Current Benchmarks and Evaluation Techniques
- Contrary findings: There is a lack of trustworthy evaluation, as evidenced by the fact that existing benchmarks occasionally show smaller models outperforming bigger ones (e.g., Llama-3.1 8B > 70B).
- Traditional benchmarks’ inadequacy: Scrolls and other natural language benchmarks are no longer appropriate for the broader context of LCLMs.
- Overuse of synthetic tasks and perplexity: Although perplexity and synthetic tasks, such as “needle-in-a-haystack” (NIAH), have gained popularity, they frequently don’t accurately represent performance in the actual world. According to Hugging Face, “recent works have shown that perplexity does not correlate well with downstream performance (Fang et al., 2024).” According to it findings, “synthetic tasks like NIAH do not correlate with real-world performance.”
- Diverse datasets and a lack of standardization: Model developers frequently test their models on various arbitrary datasets, which makes direct comparisons challenging.
- Particular drawbacks of the current realistic benchmarks (InfiniteBench, LongBench, and ZeroScrolls):inadequate attention to downstream tasks: frequently concentrated on specific domains.
- Insufficient lengths: A lot of datasets are only available in contexts with fewer than 32K tokens, which is insufficient for assessing frontier Long-context Language Models(LCLMs).
- Measures that are unreliable: Using n-gram matching measures such as ROUGE, which “do not correlate with human judgements (Goyal et al., 2023) and do not distinguish between models.”
- Instruction-tuning is necessary for several of the base models.
Presenting HELMET: An All-Inclusive LCLM Benchmark
To overcome the shortcomings of current approaches, a new benchmark called HELMET (How to Evaluate Long-context Language Models Effectively and Thoroughly) is suggested.
The following are the main requirements for its design:
- Coverage of downstream tasks that is varied: comprising summarization, generation with citations, and retrieval-augmented generation.
- Length and complexity control: Enables evaluation for input lengths ranging from 8K to 128K tokens (and readily expandable).
- Reliable assessment of instruction-tuned and basic models: employing human research and model-based assessments, which have a stronger association with human judgements than conventional measures.
HELMET’s Main Advantages Over Current Benchmarks
- Broad Coverage: A greater variety of jobs with inherently lengthy contexts that reflect real-world applications are included in HELMET. It makes use of “reliable evaluation settings, such as model-based evaluations and human studies.”
- Controllable Length and Difficulty: The benchmark permits the adjustment of input length in a variety of tasks using multiple mechanisms (e.g., document length, number of retrieved passages, demonstrations). To test frontier models, the authors purposefully selected datasets with lengthy natural documents.
- HELMET places a high priority on “model-based evaluations that show better distinguishability between models and different input lengths” and verifies their dependability through “human studies.” They specifically abandon noisy n-gram measures such as ROUGE. A diagram (diagram 3) in the document shows how model-based evaluations are successful but ROUGE is unable to distinguish.

- Robust Prompting: HELMET is more representative of real-world applications where base models are essential by allowing the evaluation of both instruction-tuned and base models using in-context learning instances for a subset of tasks.
Important Results from HELMET Assessments (for 59 LCLMs)
- Diverse assessment is essential: Low correlations between performance across many task categories (such as RAG, summarization, and citation) suggest that assessing on a single task is insufficient to comprehend total capabilities. ICL has the lowest connection with other tasks, suggesting it is distinct and requires different model skills.
- As task complexity and length increase, models deteriorate: As input length rises, even sophisticated models like GPT-4o and Gemini perform worse on challenging tasks like re-ranking. Synthetic task performance does not show this performance change.
- On complicated tasks, open-source models perform worse than closed-source models: On easier jobs, the difference could be slight, but on more difficult ones, like citation production, it grows considerably.
- No one model is the “best” in every category: This emphasizes the necessity of thorough assessment along a variety of dimensions.
Applying HELMET to Upcoming Advancements
- Simple to use: A GitHub repository with an easy-to-follow setup makes the benchmark available.
- Flexible model loading: Allows for a number of model integration techniques, such as:
- HuggingFace’s library of converters
python eval.py --config configs/rag.yaml --model_name_or_path <model_name>
- The TGI of HuggingFace
input_max_length: 131072
datasets: kilt_nq
generation_max_length: 20
test_files: data/kilt/nq-dev-multikilt_1000_k1000_dep6.jsonl
demo_files: data/kilt/nq-train-multikilt_1000_k3_dep6.jsonl
use_chat_template: true
max_test_samples: 100
shots: 2
stop_new_line: true
model_name_or_path: tgi:meta-llama/Llama-3.1-8B-Instruct # need to add "tgi:" prefix
use_tgi_serving: true # add this line in your config
Then use the command below to run the benchmark
export LLM_ENPOINT=<your-tgi-endpoint> # example: "https://10.10.10.1:8080/v1" python eval.py --config configs/config.yaml --endpoint_url $LLM_ENDPOINT
- The Inference Endpoints from HuggingFace (which supports Intel Gaudi accelerators)
export LLM_ENPOINT=<your-hf-inference-endpoint> # example: "https://XXXX.us-east-1.aws.endpoints.huggingface.cloud/v1"
export API_KEY=<your-hf-api-key>
python eval.py --config configs/config.yaml --endpoint_url $LLM_ENDPOINT --api_key $API_KEY
- Using vllm
model_name_or_path: meta-llama/Llama-3.1-8B-Instruct # no prefix needed
use_vllm_serving: true # use vllm instead of tgi
Then use the command below to run the benchmark
export LLM_ENPOINT=<your-vllm-endpoint>
python eval.py --config configs/config.yaml --endpoint_url $LLM_ENDPOINT
- APIs for model providers (OpenAI, TogetherAI, Anthropic, and Google)
- Faster development cycles: Because the RAG and Recall tasks balance speed and correlate with other practical activities, they are recommended for rapid iteration.
- Direct comparison with current models: Without having to conduct in-depth analyses themselves, researchers can use HELMET’s findings on 59 models to compare their models directly. They have a leaderboard on their website.
Looking Ahead: LongProc Integration
LongProc, a benchmark designed especially for assessing Long-context Language Models(LCLMs) on long-form generation and procedure following, was just published by the authors.
In order to offer an even more complete assessment suite, they are attempting to include LongProc into HELMET, especially for tasks that call for extremely lengthy outputs (up to 8K tokens).
In summary
An important development in the assessment of long-context language models is HELMET. HELMET offers a more comprehensive and accurate understanding of LCLM capabilities by overcoming the drawbacks of previous benchmarks with its varied tasks, configurable length, trustworthy assessment measures, and support for both basic and instruction-tuned models.
The preliminary results demonstrate that even state-of-the-art models encounter difficulties as task complexity and context length increase, underscoring the significance of varied evaluation. In order to efficiently evaluate and distinguish between Long-context Language Models(LCLMs), HELMET provides academics and practitioners with a useful tool that will ultimately advance this quickly developing subject.