The BrowseComp: Benchmarking Web Browsing Agents

A straightforward and demanding benchmark that assesses AI agents’ capacity to find information that is difficult to find.

Artificial intelligence (AI) agents that can learn by browsing the internet are becoming more and more significant. Finding difficult-to-find information, which may involve perusing tens or even hundreds of websites, should be possible for a performant browsing agent. Models having access to quick browsing tools, like GPT‑4o with browsing, have already overloaded existing benchmarks like SimpleQA, which assess models’ capacity to extract fundamental isolated facts. OpenAI is open-sourcing a new benchmark of 1,266 difficult issues called BrowseComp, or “Browsing Competition,” to assess AI agents’ capacity to find entangled, difficult-to-find material on the internet. You can read the study article, and the benchmark can be found in OpenAI’s simple evals github repository⁠.

Concerning the BrowseComp standard

OpenAI developed BrowseComp as a benchmark for browsing that is easy to verify and difficult for models. The fact that large language models typically provide lengthy, open-ended responses by default is one of the primary obstacles to their evaluation. It concentrate on questions with brief responses and, theoretically, only one right one. It is uncertain how much BrowseComp performance correlates with performance on an actual user distribution, which is open-ended, because of this emphasis on brief answers. Because short answers are easy to grade and the benchmark is user-friendly, OpenAI is willing to tolerate this trade-off.

OpenAI asked human trainers to construct difficult, fact-seeking questions with a single, unquestionable, succinct response that would not change over time and was backed by evidence, in accordance with the rules of OpenAI’s prior factuality standard SimpleQA⁠. The unique feature of BrowseComp is that its trainers crafted incredibly difficult questions. To make sure the questions were suitably difficult, OpenAI employed three checks:

The question could not be answered by the models that were available at the time. Trainers were requested to confirm that GPT‑4o (both with and without browsing), o1, and an early deep research model could not solve the challenges.
Trainers were instructed to conduct five basic searches and make sure the solution was not found on any of the search engine’s first few pages of results.
Trainers were instructed to design exercises that were difficult enough for someone else to fail to complete in ten minutes. Although this was not strictly enforced, a second trainer tried to figure out the answers to some of the questions. Trainers were requested to update their assignments if they produced tasks that were completed more than 40% of the time.

OpenAI advised trainers to start with a fact and then craft a “inverted” question, where the answer is difficult to locate but simple to confirm, to develop difficult questions. A “seed” (which may be a person, event, or artefact) would be the starting point for trainers, who would then use a wide search space to find many traits and turn them into a question. We provided the following sample question:

Please tell me the title of the scientific work that was published at the EMNLP conference between 2018 and 2023. The first author completed their undergraduate studies at Dartmouth College, and the fourth author completed their undergraduate studies at the University of Pennsylvania. (Response: EMNLP 2021, Frequency Effects on Syntactic Rule Learning in Transformers)

This query can be easily verified with a few web searches, but finding the answer is challenging because a brute-force search would involve going through thousands of publications and investigating the backgrounds of each author. Hard-to-solve but easily verifiable questions (also known as “asymmetry of verification”) are good benchmarks since they are both difficult and trustworthy to grade.

Even though BrowseComp is straightforward, it assesses an AI agent’s capacity for productive browsing:

Models need to be able to reason about the veracity of online content in order to provide the right response.
Doing well on BrowseComp needs perseverance and a deep browsing skill because the answers are difficult to discover.
A brute-force method would be too time-consuming (or impossible) to uncover many solutions. Therefore, the model needs to be inventive in its search for the right answer in order to finish in a fair amount of time.

For browsing agents, BrowseComp is a helpful but unfinished benchmark. Although BrowseComp avoids the difficulties of a real user inquiry distribution, such as producing lengthy responses or clearing up ambiguity, it assesses the crucial fundamental skill of using perseverance and inventiveness to obtain information. As a rough comparison, models who win programming contests such as CodeForces exhibit strong coding skills that probably translate well to other coding jobs, though this is not a given. Similar to this, the model must be highly skilled in finding difficult-to-find information in order to solve BrowseComp, however this isn’t always the case for all browsing-related tasks.

Diversity and difficulty of the dataset

OpenAI hoped that making data points about personal interests would result in a more engaging experience and higher quality data, therefore, OpenAI invited trainers to generate questions about areas that they were personally interested in when developing the BrowseComp benchmark.

OpenAI requested human trainees to attempt to answer BrowseComp questions as a way to gauge how difficult the dataset is. These trainers belonged to the same group of trainers who came up with the questions, yet they were unable to answer them. The human trainers were requested to finish the work without the use of an AI helper (particularly, without the use of ChatGPT, Claude, Perplexity, Grok, or Gemini) and were not given access to the right solution to the question.

OpenAI allows trainers to mark a problem as unsolvable and proceed if they can find the solution within two hours of research, because some questions are quite difficult. As can be seen below, 29.2% of problems were resolved by trainers, and 86.4% of the time, the trainer’s solution matched the original reference answer.

Total problems in verification campaign	1,255
Unsolvable	888 / 1,255 (70.8%)
Solvable	367 / 1,255 (29.2%)
Of solvable problems, trainer answer and reference answer agree	317 / 367 (86.4%)

OpenAI models’ performance

On BrowseComp, OpenAI assessed a variety of models, including GPT‑4o, GPT‑4.5, and OpenAI o1 (medium) models without browsing, as well as GPT‑4o with browsing and Deep Research, an agent model specifically trained for persistent web surfing. The difficulty of the benchmark is highlighted by the fact that GPT‑4o and GPT‑4.5 obtained nearly zero accuracy, as indicated in the table below. Without strong reasoning or tool use, models are unable to discover the kinds of esoteric, multi-hop facts that BrowseComp targets.

Model	Accuracy (%)
GPT‑4o	0.6
GPT‑4o w/ browsing	1.9
GPT‑4.5	0.9
OpenAI o1	9.9
Deep research	51.5

Although accuracy increased slightly (from 0.6% to 1.9%) when browsing was enabled for GPT-4o, performance stayed poor. This suggests that browsing by itself is insufficient; models also need to be able to think strategically, recognise pertinent search paths, and decipher content that has been retrieved. This suggests that some BrowseComp responses can be surfaced through inference over internal knowledge. In contrast, OpenAI o1, which lacks browsing capabilities but has a superior reasoning capacity, achieves considerably higher accuracy. Overall, these findings show that reasoning and tool use both significantly affect BrowseComp performance.

About half of the problems are resolved by Deep Research, which performs noticeably better than any other model. It can answer problems that would otherwise be unanswerable because of its capacity to independently search the web, analyse and synthesise data from many sources, and modify its search approach. It excels at answering the specialised, non-intuitive questions that necessitate perusing a large number of websites exactly the type of challenge that BrowseComp is intended to assess by synthesising vast amounts of online information, changing course in response to what it finds, and citing each claim.

Test-time compute scaling

As demonstrated by OpenAI o1 on AIME⁠ and by OpenAI o3‑mini low/medium/high⁠, a crucial characteristic of agents is that performance scales with respect to the amount of computation used at inference time. Similarly, since the questions necessitate iteratively viewing a lot of websites and aggregating information, OpenAI should anticipate that more inference-time computation enhances BrowseComp performance.

Aggregation techniques that make use of extra computing

OpenAI assessed whether the Deep Research model’s performance would be enhanced if it could use even more compute by attempting each problem several times and employing different strategies to select the best answer, in addition to improved performance as a function of compute used in a single model attempt. In this experiment, it tested three methods were tested: best-of-N, weighted voting, and majority voting, for aggregating the 64 sampled outputs per question from the model.

The majority vote chooses the most common example answer.
Weighted voting involves zero-shot triggering the model to create a confidence score for each attempt and voting based on that value.
Best-of-N chooses the most confident output.

Conclusions

BrowseComp assesses a model’s ability to search the internet for difficult-to-find information. BrowseComp assesses the capacity to locate a single targeted piece of information, is easy to analyse, and presents a challenge to current browsing agents, even if it does not seek to measure performance on frequent queries. By making BrowseComp open-source, OpenAI intends to stimulate research into more dependable and trustworthy AI.

The BrowseComp: Benchmarking Web Browsing Agents

Concerning the BrowseComp standard

Diversity and difficulty of the dataset

OpenAI models’ performance

Test-time compute scaling

Aggregation techniques that make use of extra computing

Conclusions

Intel OneAPI Speeds Up Radar Processing For Worker Safety

How neoAI Scales Enterprise GenAI with Intel Gaudi 2

The LUMI Supercomputer specs, 3 World-Changing Applications

LEAVE A REPLY Cancel reply

Page Content

Recent Posts

Intel OneAPI Speeds Up Radar Processing For Worker Safety

MediaTek Dimensity 9400+: Premium 5G Processor For Phones

OPPO A5 Pro Price, OPPO A5 Pro Specs explained in detail

KETS Quantum Security Wins £1.7m Innovate UK QKD Contract

How neoAI Scales Enterprise GenAI with Intel Gaudi 2

Windows Hotpatching: New Updates In Windows Server 2025

About Us

POPULAR CATEGORY