The API’s introduction of GPT-4.1
A brand-new line of GPT models that includes the first-ever nano model along with significant advancements in coding, instruction following, and lengthy context.
GPT 4.1, GPT-4.1 mini, and GPT-4.1 nano are the three new models that the developers are introducing in the API today. With significant improvements in coding and instruction following, these models perform better overall than GPT-4o and GPT-4o mini. Additionally, they may employ that information more effectively with enhanced long-context understanding and have wider context windows that handle up to 1 million tokens of context.
GPT 4.1 performs exceptionally well on the following industry-standard metrics:
- Coding: GPT-4.1 is a top model for coding, scoring 54.6% on SWE-bench Verified, improving by 21.4%abs over GPT-4.1 and 26.6%abs over GPT-4.5.
- Instruction following: GPT‑4.1 scores 38.3% on Scale’s Multichallenger benchmark, which gauges students’ ability to follow instructions. This represents a 10.5% increase over GPT‑480.
- Lengthy context: GPT‑4.1 achieves a new state-of-the-art score of 72.0% in the lengthy, no subtitles category on Video-MME, a benchmark for multimodal long context understanding. This is a 6.7% improvement over GPT‑4o.
Although benchmarks offer insightful information, researchers trained these models with an emphasis on practical applications. It were able to optimize these models for the activities that are most important to their applications through close cooperation and interaction with the developer community.
In order to do this, the GPT 4.1 model family provides outstanding performance at a reduced price. Performance is improved by these models at all latency curve points.

In many benchmarks, GPT‑4.1 mini surpasses GPT‑4o, marking a substantial improvement in tiny model performance. It reduces latency by almost half and costs by 83%, while matching or surpassing GPT-4o in intelligence evaluations.
The quickest and least expensive variant is GPT 4.1 micro, which is ideal for workloads requiring minimal latency. With its 1 million token context window, it offers remarkable performance in a compact package. It outperforms GPT‑4o tiny by scoring 80.1% on MMLU, 50.3% on GPQA, and 9.8% on Aider polyglot coding. For jobs like autocompletion or classification, it’s perfect.
The GPT 4.1 models are also significantly more effective at enabling agents systems that may autonomously do activities on behalf of users to these advancements in lengthy context comprehension and instruction following reliability. Developers can now create agents that are more dependable and helpful at real-world software engineering, extracting insights from large documents, handling customer requests with little assistance, and other challenging tasks when paired with primitives like the Responses API.
Please take note that GPT 4.1 will only be accessible through the API. The most recent version of GPT‑4o has gradually integrated many of the enhancements in intelligence, code, and instruction following in ChatGPT, and the team will keep adding more in subsequent versions.
Additionally, one can will start deprecating GPT‑4.5 Preview in the API since GPT‑4.1 provides comparable or better performance on a number of important features at significantly reduced latency and cost. In order to give developers time to adjust, GPT-4.5 Preview will be discontinued on July 14, 2025, three months from now. It has learnt a lot from developer comments since GPT 4.5 was released as a research preview to investigate and test a huge, computationally demanding model.
Future API models will continue to incorporate the inventiveness, writing quality, humor/, and nuance that you expressed an appreciation for in GPT-4.5. As well as examples from alpha testers that show how GPT 4.1 functions in production on domain-specific tasks, including Thomson Reuters, and Carlyle.
Instruction following
It have observed notable improvements in GPT 4.1 across a range of training after evaluations, and it follows instructions more consistently.
It created an internal evaluation for instruction following in order to monitor model performance in a variety of dimensions and in a number of important instruction following areas, such as:
- The format is as follows. Supplying guidelines that outline a unique format such as XML, YAML, Markdown, etc. for the model’s output.
- Negative directives. Defining the conduct that the model must refrain from. (For instance, “Avoid requesting that the user contact support.”)
- Instructions were ordered. Giving the model a set of guidelines to follow in a specific order. (For instance, “ask for the user’s name first, then their email.”)
- Requirements for content. Producing content that contains certain data. (For instance, “When creating a nutrition plan, always include the amount of protein.”)
- Ordering. Putting the output in a specific order. “Sort the response by population count,” for instance.
- Overconfidence. Instructing the model to respond with “I don’t know” or a similar response in the event that the requested information is unavailable or does not fit into a specific category. (For instance, “provide the support contact email if you are unsure of the answer.”)
Developers’ input on the aspects of instruction following that are most pertinent and significant to them led to the creation of these categories.
People have divided the prompts into easy, medium, and hard categories. In example, GPT 4.1 performs noticeably better on hard prompts than GPT 4.0.
Improved instruction following increases the reliability of already-existing applications and opens the door for new ones that were previously constrained by low reliability. There advise being explicit and specific in prompts because early testers pointed up that GPT 4.1 can be more literal. Please consult the prompting guide for additional information about GPT 4.1 prompting best practices.
Extended Context
Up to 1 million tokens of context can be processed by GPT 4.1, GPT-4.1 mini, and GPT-4.1 nano, compared to 128,000 by earlier GPT-4o models. Long context works well for processing huge codebases or several lengthy documents, as one million tokens is equivalent to more than eight copies of the complete React codebase.
To reliably attend to information over the whole 1 million context duration, researchers trained GPT-4.1. Additionally, it has been trained to be significantly more accurate than GPT-4o at identifying pertinent content and ignoring distractions for both short and long context lengths. Applications in coding, customer service, law, and many other fields require the capacity to comprehend long-term context.
Below, will show how GPT 4.1 can recover a tiny hidden piece of data (a “needle”) that is placed at several locations throughout the context window. Up to one million tokens can be used, and GPT 4.1 reliably and precisely recovers the needle at all places and context lengths. Regardless of where they are in the input, it can efficiently extract pertinent facts for the task at hand.
But extracting a single, clear needle solution is not as simple as many real-world operations. It see that those models are frequently required by users to retrieve and comprehend several bits of information, as well as to comprehend those components in relation to one another. Currently open-sourcing a new evaluation called OpenAI-MRCR (Multi-Round Coreference) to demonstrate this capacity.
The model’s capacity to locate and distinguish between several needles that are well buried in context is tested using OpenAI-MRCR. The assessment comprises simulated multi-turn dialogues between a user and an assistant in which the user requests a written piece on a particular subject, such as “write a blog post about rocks” or “write a poem about tapirs.” Next, may add two, four, or eight of the same requests to the context. The response for a particular instance (for example, “give me the third poem about tapirs”) must then be retrieved by the model.
The problem is that these requests are identical to each other, and little variations like a poetry about frogs instead of tapirs or a short tale about tapirs instead of a poem can easily fool the context models. At context lengths up to 128K tokens, one can find that GPT‑4.1 performs better than GPT‑4o, and it continues to perform well up to 1 million tokens.
However, even with sophisticated reasoning models, the task is still challenging. The veal dataset is being shared in order to stimulate more research on long-context retrieval in the real world.
Real world examples
Reuters: Reuters used CoCounsel, their professional-grade AI legal assistant, to test GPT 4.1. When employing GPT 4.1 across internal long-context benchmarks, which are a crucial indicator of CoCounsel’s capacity to manage intricate legal workflows involving several, lengthy documents, they were able to increase multi-document review accuracy by 17% in comparison to GPT‑4o.
They discovered that the model was very good at preserving context between sources and correctly spotting subtle connections between documents, including contradictory clauses or other supplemental context tasks that are essential to legal research and decision-making.
Carlyle: Carlyle successfully extracted detailed financial information from a variety of long documents, such as PDFs, Excel files, and other intricate formats, using GPT 4.1. It was the first model to successfully overcome important limitations observed with other available models, such as needle-in-the-haystack retrieval, lost-in-the-middle errors, and multi-hop reasoning across documents, and it performed 50% better on retrieval from very large documents with dense data, according to their internal evaluations.
To stay up to date and satisfy user demands, developers require models that respond fast in addition to those that are accurate and perform well. The time to first token has been shortened by the inference stack improvement, and you may further reduce latency and save money by using prompt caching.
It found that the latency to the first token for GPT 4.1 was roughly 15 seconds when 128,000 tokens of context were used, and a minute when a million tokens of context were used. GPT 4.1 mini and nano are faster; for example, GPT-4.1 nano often returns the first token for queries with 128,000 input tokens in less than five seconds.
Vision
With GPT-4.1 mini in particular marking a major advancement and frequently outperforming GPT-4.1o on picture benchmarks, the GPT 4.1 family is remarkably powerful at image interpretation.
When processing lengthy films or other multimodal use cases, long context performance is especially crucial. A model responds to multiple-choice questions in Video-MME (long w/o subs) based on 30- to 60-minute videos that aren’t subtitled. Compared to GPT‑4o, which scored 65.3%, GPT‑4.1 reaches state-of-the-art performance, achieving 72.0%.
GPT 4.1 Models Pricing
All developers can now access GPT 4.1, GPT 4.1 mini, and GPT 4.1 nano.
Everyone have been able to provide reduced rates on the GPT‑4.1 series by improving the efficiency of its inference algorithms. GPT‑4.1 is 26% less expensive than GPT‑4o for median queries, and GPT‑4.1 nano is the fastest and least expensive model to date. Nous are raising the prompt caching discount for these new models from 50% to 75% for queries that pass the same context repeatedly. Lastly, also provide extended context requests for free, on top of the usual per-token fees.
Model (Prices are per 1M tokens) | Input | Cached input | Output | Blended Pricing |
gpt-4.1 | $2.00 | $0.50 | $8.00 | $1.84 |
gpt-4.1-mini | $0.40 | $0.10 | $1.60 | $0.42 |
gpt-4.1-nano | $0.10 | $0.025 | $0.40 | $0.12 |
In conclusion
An important advancement in the real-world use of AI is GPT 4.1. These models provide up new options for creating intelligent systems and complex agentic applications by closely examining real-world developer needs, which range from coding to instruction-following and lengthy context understanding.