Presenting the Real-time Application Programming Interface
Fast speech-to-speech experiences can now be incorporated into applications by developers.
OpenAI is launching the Realtime API public beta today, allowing all developers who have paid to create low-latency, multimodal experiences for their applications. The Realtime API allows natural speech-to-speech discussions utilizing the six preset voices that are already supported by the API, much to ChatGPT’s Advanced Voice Mode.
OpenAI Chat Completions API
In the Chat Completions API, OpenAI also added audio input and output to accommodate use cases that don’t require the low-latency advantages of the Realtime API. With this version, developers can instruct GPT-4o to respond to any text or audio input by giving it either text, audio, or both.
Developers have been using speech experiences to engage consumers in a variety of software applications, such as language apps, educational programs, and customer service interfaces. The integration of several models to enable these experiences is no longer necessary for developers with the Realtime API and the upcoming audio feature in the Chat Completions API. Alternatively, a single API call can be used to create natural conversational interactions.
How it functions
Developers have to use an automatic speech recognition model like Whisper to transcribe audio, then feed the text to a text model for reasoning or inference, and finally use a text-to-speech model to play the model’s output in order to produce a similar voice assistant experience. This method frequently produced audible lag along with a loss of emphasis, accent, and feeling.
Although it is still slower than a human chat, developers may manage the complete process with a single API call with the Chat Completions API. By directly streaming audio inputs and outputs, the Realtime API enhances this and makes for more lifelike conversational experiences. Moreover, it has automatic interruption handling capabilities, much as ChatGPT’s Advanced Voice Mode.
In order to communicate with GPT-4o, you can establish a persistent WebSocket connection using the Realtime API. Voice assistants can reply to user queries by initiating operations or bringing up new context with the API’s capability for function calling. For instance, a voice assistant may order something for the user or get pertinent client data to customize its responses.
Pricing and Availability
All paid developers will be able to access the Realtime API starting today in public beta. The new GPT-4o model {gpt-4o-realtime-preview} powers the Realtime API’s audio features.
In the upcoming weeks, a new model called gpt-4o-audio-preview} will be published with audio capabilities in the Chat Completions API. Developers can use
gpt-4o-audio-preview` to enter text or audio into GPT-4o and receive text, audio, or both as replies.
Audio tokens and text tokens are both used by the Realtime API. $5 for 1 million text input tokens and $20 for 1 million output tokens are the prices. One million tokens of audio input cost $100, whereas one million tokens of output cost $200. This translates to an approximate audio intake of $0.06 and an approximate audio output of $0.24 per minute. The Chat Completions API’s audio will cost the same.
Security and seclusion
The Realtime API employs several safety measures, such as automated monitoring and human evaluation of flagged model inputs and outputs, to reduce the possibility of API misuse. The GPT-4o version that underpins ChatGPT’s Advanced Voice Mode, which OpenAI thoroughly evaluated using both automatic and human assessments including assessments conducted in accordance with its Preparedness Framework, which is described in detail in the GPT-4o System Card is the foundation upon which the Realtime API is based. Its testing indicates that the audio safety infrastructure it developed for Advanced Voice Mode, which helps to lower the risk for injury, is also utilized by the Realtime API.
Repurposing or distributing content from its services to propagate spam, deceive, or cause harm to others is prohibited by its usage terms, and OpenAI keep a close eye out for any possible misuse. In accordance with its principles, developers must also explicitly inform consumers that they are dealing with AI, unless the context makes this clear.
OpenAI tested the Realtime API before launch using its external red teaming network, and it discovered that there were no high-risk gaps that the Realtime API introduced that weren’t already mitigated. The Realtime API is governed by its Enterprise privacy pledges, just like all other API services. Without your express consent, it do not use the inputs or outputs utilized in this service to train its models.
Getting Started
Over the next several days, developers can begin constructing using the Realtime API in the Playground, or by utilizing its documentation and the reference client.
Along with integrating the Realtime API with Twilio’s Voice APIs, which allow developers to easily build, deploy, and connect AI virtual agents to customers via voice calls, OpenAI also worked with LiveKit and Agora to create client libraries of audio components like echo cancellation, reconnection, and sound isolation.
Next up
In order to enhance the Realtime API, OpenAI is actively gathering input as it get closer to wide release. Among the features it intend to include are:
Additional modalities: The Realtime API will initially handle voice, and OpenAI intend to gradually add more modalities including vision and video.
Increased rate limits: As of right now, Tier 5 developers can only use the API for up to 100 simultaneous sessions, while Tiers 1-4 are subject to lesser rate limits. In order to accommodate larger deployments, OpenAI will gradually raise these limitations.
Official SDK support: The OpenAI Python and Node.js SDKs will include Realtime API functionality.
Prompt Caching: In order to allow for the discounted processing of earlier discussion turns, OpenAI will incorporate support for Prompt Caching.
Increased model compatibility: Future iterations of the GPT-4o mini will also be supported by the Realtime API.
OpenAI is excited to see how developers use these new powers to craft engaging new audio experiences for consumers in a range of contexts, including education, customer support, translation, accessibility, and more.