News
OpenAI Releases Three Realtime Voice Models, Including GPT-Realtime-2 With GPT-5-Class Reasoning

Image: Flickr / Wikimedia Commons

OpenAI Releases Three Realtime Voice Models, Including GPT-Realtime-2 With GPT-5-Class Reasoning

GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper are now available through the Realtime API, which exits beta and goes generally available.

May 8, 20264 min read

This article was produced by the AETW editorial team.

OpenAI launched three new audio models through its Realtime API on May 7, 2026. The flagship, GPT-Realtime-2, brings GPT-5-class reasoning to live voice interactions for the first time, while companion models handle real-time translation across 70+ languages and low-latency streaming transcription.

Three models, one release

On May 7, OpenAI shipped three new models through its Realtime API and moved the API itself out of beta into general availability. The trio covers distinct use cases: GPT-Realtime-2 is a full conversational voice agent capable of reasoning and tool use mid-conversation; GPT-Realtime-Translate is a dedicated live speech translation model supporting over 70 input languages and 13 output languages; and GPT-Realtime-Whisper is a streaming transcription model that converts speech to text as the speaker talks, rather than waiting for completed audio chunks.

The GA announcement is notable on its own. Developers who held off building production systems on a beta API now have a stable surface to build against, and all three new models are available immediately through the API and testable in the OpenAI Playground.

What GPT-Realtime-2 actually changes

The central claim around GPT-Realtime-2 is GPT-5-class reasoning applied to live voice. In practical terms, that means the model is built to keep a conversation moving while it reasons through a request, calls tools, handles interruptions, and recovers from corrections - without going silent or stalling. OpenAI expanded the context window from 32K to 128K tokens, which allows the model to carry longer sessions without losing earlier context, a recurring failure mode in previous voice agents.

Developers can control reasoning intensity across five tiers - minimal, low, medium, high, and xhigh - with low as the default to keep latency acceptable for routine queries. The model also supports parallel tool calls and can produce short bridging phrases like 'let me check that' while processing, so users know something is happening. On benchmarks, GPT-Realtime-2 at the high reasoning setting scored 96.6% on Big Bench Audio, up from 81.4% for its predecessor GPT-Realtime-1.5, and 48.5% on Audio MultiChallenge for multi-turn instruction following, compared to 34.7% previously.

One caveat worth noting: the headline benchmark numbers were run at high and xhigh settings. The default in production is low, so real-world performance for most deployments will sit below those figures. Audio MultiChallenge also sits below 50%, which is a useful indicator that complex multi-turn spoken dialogue remains an open problem - better than before, but not solved.

Translation and transcription as standalone infrastructure

GPT-Realtime-Translate and GPT-Realtime-Whisper are not supporting acts - they are purpose-built pipes for specific jobs. Realtime-Translate handles live speech conversion between languages, preserving meaning while keeping pace with the speaker across regional accents and specialized vocabulary. BolnaAI, an early tester, reported 12.5% lower word error rates for Hindi, Tamil, and Telugu compared to other models. Deutsche Telekom is using the model for customer support, letting callers speak in their preferred language while the system handles translation on both ends.

Realtime-Whisper, meanwhile, is purely transcription - speech in, text out, no reasoning or voice response. It transcribes as the speaker talks rather than waiting for a complete audio segment, which makes it relevant for live captions, meeting notes, and voice agents that need continuous awareness of the user rather than a turn-by-turn model.

Early adopter numbers

OpenAI shared results from several companies testing GPT-Realtime-2 ahead of the release. Zillow reported a 26-point improvement in call success rate on its hardest adversarial benchmark after prompt optimization, reaching 95% from 69%. Priceline is exploring a voice travel assistant that could manage flight searches, hotel changes, and on-the-ground translation within a single session. Glean reported a 42.9% relative increase in helpfulness over the previous version in internal evaluations for organizational voice interactions. Genspark said its Call for Me agent saw a 26% increase in effective conversation rate and fewer dropped calls after moving to GPT-Realtime-2.

These numbers come from the companies themselves and reflect specific deployment contexts. They indicate the model performs meaningfully better on structured agentic tasks, which is the category most likely to benefit from the expanded context window and parallel tool calling.

Pricing

GPT-Realtime-2 is priced at $32 per million audio input tokens, with cached input at $0.40 per million, and $64 per million audio output tokens. GPT-Realtime-Translate costs $0.034 per minute. GPT-Realtime-Whisper costs $0.017 per minute. Independent benchmarking from Artificial Analysis puts effective audio pricing at roughly $1.15 per hour for input and $4.61 per hour for output on GPT-Realtime-2.

The open question

OpenAI framed this release around three interaction patterns - voice-to-action (agent completes a task), systems-to-voice (backend processes surfaced as speech), and voice-to-voice (live cross-language conversation). The framing is a signal about where they believe the market is heading: voice as a primary interface for agentic workflows, not a secondary modality bolted onto a text-first system.

The Realtime API going GA and the simultaneous release of three purpose-built audio models suggests OpenAI is moving to establish infrastructure stakes in this category before competitors close the gap. Gemini 3.1 Flash Live Preview matches GPT-Realtime-2 on Big Bench Audio at the same 96.6% score, and Grok Voice Agent leads on latency at 0.78 seconds to first audio versus 2.33 seconds for GPT-Realtime-2 at high reasoning. The reasoning quality gains are real, but the competitive field in voice AI is less differentiated than it might appear from a single benchmark headline.

Sources

Brian Weerasinghe

AI & Technology Researcher

Brian Weerasinghe is the founder and editor of AI Eating The World, where he covers artificial intelligence, tech companies, layoffs, startups, and the future of work. His reporting focuses on how AI is transforming businesses, products, and the global workforce. He writes about major developments across the AI industry, from enterprise adoption and funding trends to the real-world impact of automation and emerging technologies.

Trusted AI LeaderTrusted AI LeaderTrusted AI LeaderTrusted AI Leader
Trusted by founders and builders

The most important AI developments, distilled daily

Join the community of builders, researchers, and executives who start their morning with our curated intelligence brief.

Free, no spam, unsubscribe anytime.