Friday, February 7, 2025

OpenAI Operator: A CUA-Powered Agent for Web Task Execution

Computer-Using Agent

A common interface via which AI can communicate with the digital environment.

A research preview of OpenAI Operator, an agent that can access the web and carry out tasks for you, was presented today. The Computer-Using Agent (CUA) model, which powers OpenAI Operator, combines sophisticated reasoning through reinforcement learning with the vision capabilities of GPT-4o. The buttons, menus, and text fields that people see on a screen are known as graphical user interfaces (GUIs), and CUA is trained to interact with them in the same way that humans do. This enables it to carry out digital tasks without the need for web-specific or OS-specific APIs.

CUA expands upon years of fundamental studies at the nexus of multimodal reasoning and understanding. It can divide activities into multi-step plans and adaptively self-correct when difficulties arise by fusing sophisticated GUI perception with structured problem-solving. By enabling models to use the same tools that people use on a daily basis, this capability represents the next stage in the evolution of AI and opens the door to a wide range of new applications.

Achieving a 38.1% success rate on OSWorld for tasks requiring full computer use, 58.1% on WebArena, and 87% on WebVoyager for tasks requiring web-based interaction, CUA sets new state-of-the-art benchmark scores, despite its early stages and restrictions. These findings demonstrate how CUA may function in a variety of settings by utilising a single universal action space.

As outlined in its Operator System Card, OpenAI prioritizes safety when developing CUA to solve the difficulties that arise when an agent has access to the digital world. As part of its incremental deployment approach, it is launching CUA for Pro⁠ Tier users in the United States through a research peek of OpenAI Operator at operator.chatgpt.com⁠

How it operates

How Computer-Using Agent works
How Computer-Using Agent works

CUA uses a virtual mouse and keyboard to perform activities and interprets raw pixel data to comprehend what is happening on the screen. It is capable of handling mistakes, navigating multi-step activities, and adjusting to sudden changes. Because of this, CUA may function in a variety of digital contexts, completing tasks like accessing websites and completing forms without the need for specialised APIs.

CUA functions by an iterative loop that combines perception, reasoning, and action in response to a user’s instruction:

Perception: A visual representation of the computer’s current status is provided by adding screenshots from the computer to the model’s environment.

Reasoning: CUA uses a chain-of-thought reasoning process to determine the future steps while accounting for previous and present screenshots and actions. By allowing the model to assess its observations, monitor intermediate stages, and make dynamic adjustments, this internal monologue enhances task performance.

Action: Until it determines that the activity is finished or that user interaction is required, it clicks, scrolls, or types. CUA asks for user approval for critical tasks like inputting login information or answering CAPTCHA forms, even though it completes the majority of procedures automatically.

Assessments

By utilising the same universal interface of screen, mouse, and keyboard, CUA sets new benchmarks for the state-of-the-art in computer and browser use.

Benchmark typeBenchmarkComputer use (universal interface)Web browsing agentsHuman
OpenAI CUAPrevious SOTAPrevious SOTA
Computer useOSWorld38.1%22.0%72.4%
Browser useWebArena58.1%36.2%57.1%78.2%
WebVoyager87.0%56.0%87.0%

Browser use

The purpose of WebArena⁠ and WebVoyager⁠ is to assess how well web browsing agents perform while utilising browsers to accomplish real-world tasks. WebArena mimics real-world situations in social forum platforms, online shop content management (CMS), e-commerce, and other areas by using self-hosted open-source websites offline. WebVoyager evaluates the model’s functionality on real-time online platforms such as Google Maps, GitHub, and Amazon.

Using the same universal interface that interprets the browser screen as pixels and responds with a mouse and keyboard, CUA establishes a new benchmark in these areas. For web-based tasks, CUA’s success rate was 87% on WebVoyager and 58.1% on WebArena. Even though CUA has a high success rate on WebVoyager, where the majority of the tasks are quite easy, it still need more development to match human performance on more difficult benchmarks like WebArena.

Computer use

A test called OSWorld⁠ assesses how well models can manage whole operating systems, including Windows, macOS, and Ubuntu. CUA achieves a success rate of 38.1% in this benchmark. It noticed test-time scaling, which indicates that when more steps are permitted, CUA performs better. The performance of CUA is contrasted with earlier state-of-the-arts with different maximum permitted steps in the picture below. Humans still have a long way to go, as evidenced by their 72.4% score on this benchmark.

OSWorld⁠ benchmark
OSWorld⁠ benchmark

CUA in the OpenAI Operator

Through a research preview of Operator, an agent that can access the web and carry out activities for you, it is enabling CUA. Pro users in the United States can access Operator at operator.chatgpt.com⁠. It has the chance to learn from its users and the larger ecosystem through this research preview, which will help us incrementally improve OpenAI Operator.

OpenAI don’t yet expect CUA to function consistently in every situation, as is the case with any early-stage technology. Nonetheless, it has already shown promise in a number of situations, and its goal is to increase that dependability for a larger set of jobs. By making CUA available in OpenAI Operator, it intends to collect insightful feedback from its users, which will help us improve its features and broaden its range of uses.

Security

CUA presents new dangers and challenges because it is one of its first agentic solutions that can perform activities directly in a browser. It conducted thorough safety testing and put mitigations in place for the three main categories of safety risks misuse, model errors, and frontier risks as it got ready to deploy OpenAI Operator. It put protections in place throughout the entire deployment environment, including the Operator system, the CUA model itself, and post-deployment procedures, because it thinks it’s critical to address safety in layers. The goal is to create stacking mitigations, where the risk profile is gradually decreased by each layer.

Misuse is the first type of risk. Building on its safety work for GPT-4o, it has created the following mitigations to lower OpenAI Operator’s risk of injury from misuse in addition to requiring users to adhere to its Usage Policies:

Refusals: The CUA model is taught to reject a variety of hazardous jobs as well as prohibited or regulated activities.

Blocklist: The OpenAI operator is unable to access websites that it has proactively blacklisted, including several gambling websites, adult entertainment websites, and stores that sell drugs or firearms.

Moderation: Automated safety checkers, which are intended to guarantee adherence to Usage Policies and have the power to stop or send warnings for forbidden activity, evaluate user interactions in real-time.

Offline detection: In order to implement our usage policies, it has also created automated detection and human review pipelines to find illegal usage in priority policy areas, such as misleading behaviours and kid protection.

Model errors fall under the second category of risk, when the CUA model unintentionally performs an activity that the user did not want, resulting in harm to the user or others. From a typo in an email to buying the incorrect item to permanently erasing a crucial document, hypothetical errors can vary in seriousness. In order to reduce possible harm, it has created the following mitigations:

User confirmations: In order to allow the user to verify the model’s work before it is finalised, the CUA model is trained to request user confirmation before completing actions that have external side effects, such as sending an email or placing an order.

Task restrictions: For the time being, the CUA model will not assist with some higher-risk jobs, such as financial transactions and tasks involving delicate decision-making.

Watch mode: OpenAI Operator necessitates active user supervision on very sensitive websites, including email, so users can immediately identify and correct any potential errors the model may make.

Adversarial attacks on websites, such as phishing attempts, jailbreaks, and prompt injections, are a significant category of model errors that result in the CUA model performing unexpected behaviours. To guard against these risks, it created a number of extra layers of defence in addition to the previously indicated mitigations against model errors:

Cautionary navigation: The CUA model recognises all but one instance from an early internal red-teaming session, and it is built to detect and disregard prompt injections on websites.

Monitoring: To monitor and halt execution if it notices questionable stuff on the screen, it has added another model to Operator.

Pipeline for detection: To find questionable access patterns that can be detected and quickly added to the monitor (within a few hours), it is using both automatic detection and human review procedures.

Lastly, it assessed the CUA model in comparison to the frontier risks described in its Preparedness Framework⁠, which included scenarios including biorisk tooling and autonomous replication. These evaluations revealed no further danger beyond GPT-4o.

OpenAI urges you to check the Operator System Card, a living document that offers transparency into its safety strategy and continuous improvements, if you would want to learn more about the assessments and safeguards.

The risks and mitigation strategies it has put in place are as novel as many of OpenAI Operator’s capabilities. It anticipates that these risks and its strategy will change as it gain more knowledge, even though it has strived for cutting-edge, varied, and complementary mitigations.

In conclusion

CUA expands on years of research into safety, logic, and multimodality. Deep reasoning (o-model series), vision (GPT-4o), and novel methods to increase resilience (instruction hierarchy and reinforcement learning) have all advanced significantly. Extending the agents’ action space is the next challenge space it intend to investigate. This problem is solved by the adaptability provided by a universal interface, which allows an agent to use any software program created for people. CUA can adapt to any computer system by going beyond specialised agent-friendly APIs, thereby tackling the “long tail” of digital use cases that are still outside the capabilities of the majority of AI models.

In order to enable developers to create their own computer-using agents, it also trying to make CUA accessible through the API⁠.

Drakshi
Drakshi
Since June 2023, Drakshi has been writing articles of Artificial Intelligence for govindhtech. She was a postgraduate in business administration. She was an enthusiast of Artificial Intelligence.
RELATED ARTICLES

Recent Posts

Popular Post

Govindhtech.com Would you like to receive notifications on latest updates? No Yes