OpenAI speech-to-text and OpenAI Text to Speech Languages

March 21, 2025

134

Presenting the API’s next-generation audio models OpenAI speech-to-text and OpenAI text to speech languages

Developers around the world can now access a new suite of audio models that fuel speech agents.

With releases like Operator, Deep Research, Computer-Using Agents, and the Responses API with built-in tools, OpenAI has made investments in improving the intelligence, capabilities, and utility of text-based agents that is, systems that autonomously complete tasks on behalf of users over the last few months. But for agents to be beneficial, people must be able to connect with them in more profound and intuitive ways than just text that is, by effectively communicating with them using natural spoken language.

OpenAI is introducing new speech-to-text and text-to-speech audio models in the API today, which will enable the development of more potent, adaptable, and intelligent voice agents that provide tangible benefits. Its most recent voice-to-text models surpass current solutions in accuracy and dependability, setting a new benchmark for the state of the art, particularly in difficult situations with accents, noisy surroundings, and variable speech rates. The models are particularly well-suited for use cases like customer call centres, meeting note transcription, and more because of these enhancements, which also raise transcription reliability.

A new degree of customisation for voice agents is made possible by the fact that developers may now, for the first time, tell the text-to-speech model to speak in a certain manner, such as “talk like a sympathetic customer service agent.” This makes it possible for a variety of customised applications, such as expressive narration for imaginative narrative experiences and more sympathetic and lively customer service voices.

Since the release of OpenAI’s first audio model in 2022, it has made a commitment to enhancing these models’ intelligence, accuracy, and dependability. Developers may create expressive, evocative text-to-speech voices and more reliable, accurate speech-to-text systems using these new audio models all within the API.

Additional information about OpenAI’s most recent audio models

Novel models for speech-to-text

In comparison to the original Whisper models, it is offering new gpt-4o-transcribe and gpt-4o-mini-transcribe models that have improved language recognition and accuracy as well as word error rates.

Across several well-known benchmarks, gpt-4o-transcribe outperforms current Whisper models in terms of Word Error Rate (WER), indicating a substantial advancement in speech-to-text technology. Targeted developments in reinforcement learning and thorough midtraining with a variety of high-quality audio datasets are directly responsible for these improvements.

These new voice-to-text models can thereby improve transcription reliability, decrease misrecognitions, and better capture speech nuances particularly in difficult situations involving accents, noisy settings, and different speech speeds. The speech-to-text API⁠ currently offers these models.

OpenAI Text to Speech Languages

Additionally, a new gpt-4o-mini-tts model with improved steerability is being introduced. The model may now be “taught” by developers not only what to say but also how to say it, allowing for more personalised experiences for use cases ranging from creative storytelling to customer support. The text-to-speech API has the model.

The models’ technological advancements

Utilising real audio datasets for pretraining

In order to maximise model performance, OpenAI’s new audio models are heavily trained on specialised audio-centric datasets, building on the GPT4o and GPT4o-mini architectures. This focused method allows for outstanding performance on a variety of audio-related tasks and offers a deeper understanding of speech nuances.

Sophisticated techniques for distillation

By improving its distillation methods, OpenAI can transfer knowledge from its largest audio models to more manageable, smaller models. By utilising sophisticated self-play techniques, its distillation datasets successfully replicate authentic user-assistant interactions by capturing realistic conversational dynamics. This enables OpenAI’s smaller models to provide outstanding responsiveness and conversational quality.

The paradigm of reinforcement learning

OpenAI incorporated a reinforcement learning (RL)-heavy paradigm for its speech-to-text models, achieving state-of-the-art transcription accuracy. Its speech-to-text solutions are incredibly competitive in difficult speech recognition settings with this technology, which significantly increases precision and decreases hallucination.

These advancements mark a step forward in the field of audio modelling, fusing cutting-edge techniques with useful improvements to improve voice application performance.

Availability of APIs

All developers now have access to these new audio models; read more about using audio in building here. Adding its speech-to-text and text-to-speech models is the easiest method to create a voice agent for developers who are already using text-based models to create conversational experiences. To make this development process easier, OpenAI offers an integration with the Agents SDK⁠. It advises developers to use its speech-to-speech models in the Realtime API when creating low-latency speech-to-speech experiences.

What comes next?

In the future, OpenAI intends to keep making investments to raise the intelligence and precision of its audio models and investigate ways to let developers use their unique voices to create even more customised experiences that adhere to its safety regulations. Furthermore, OpenAI talks about the potential and problems that synthetic voices can bring to politicians, researchers, developers, and creatives.

OpenAI speech-to-text and OpenAI Text to Speech Languages

Novel models for speech-to-text

OpenAI Text to Speech Languages

The models’ technological advancements

Utilising real audio datasets for pretraining

Sophisticated techniques for distillation

The paradigm of reinforcement learning

Availability of APIs

What comes next?

Intel OneAPI Speeds Up Radar Processing For Worker Safety

How neoAI Scales Enterprise GenAI with Intel Gaudi 2

The LUMI Supercomputer specs, 3 World-Changing Applications

LEAVE A REPLY Cancel reply

Page Content

Recent Posts

iOS 18.4.1 Update Addresses Active Security Attacks

Redmi Turbo 4 Pro Debuts with Snapdragon 8s Gen 4 Processor

Windows 11 Upgrade: Hidden Features You Should Try Now

Intel OneAPI Speeds Up Radar Processing For Worker Safety

MediaTek Dimensity 9400+: Premium 5G Processor For Phones

OPPO A5 Pro Price, OPPO A5 Pro Specs explained in detail

About Us

POPULAR CATEGORY