Here is the first thing to get straight about the OpenAI Realtime API versus Vapi, because it changes how you read everything below: they are not the same kind of thing. One is a raw model with an API in front of it. The other is a platform you build on top of that model. OpenAI’s Realtime API is a single speech-to-speech model, meaning one model that hears the caller, works out a reply and speaks back, and you wire it into a phone call yourself. Vapi runs the call for you, hands you a dashboard, brings the telephony, and lets you pick which model goes inside, including, as it happens, OpenAI’s. So this is not strictly either-or. It is build versus buy, with a real chance the honest answer is “Vapi, with OpenAI as the engine”.
Quick map of where this goes. First the build effort, because that is the fork in the road. Then the price, told honestly, because the two charge in completely different units and stacking them side by side without explaining that is how people get misled. Then telephony, then who each one is built for, then the named customers and what they tell you, then compliance, the bit we have not measured yet, and a straight answer at the end.
Build versus buy, the actual fork
Start here, because it decides most of the rest. With the OpenAI Realtime API there is no dashboard, no flow builder, no campaign manager. You get an API and SDKs in Python and JavaScript, and you write the code that connects a phone call to the model. For a developer that is a day or two of work to a first call, and more to make it production-grade. For a non-technical buyer it is a project you need to hire for. You are building the agent.
Vapi is the opposite trade. It is still a developer’s tool, its own pitch is “API-first by design”, but it gives you the scaffolding rather than the bare model. The call hosting, the orchestration, the telephony hooks, the campaign tooling, a console to watch agents in, all of that is built. You are configuring an agent, not standing one up from a WebSocket. So the labour difference is real and it cuts one way: OpenAI Realtime asks more of you, Vapi asks less.
Now the twist that stops this being a clean fight. Vapi lets you bring your own model, and OpenAI’s models are exactly the kind it is designed to route to. So one perfectly sensible build is a Vapi agent running an OpenAI model underneath, where you get Vapi’s orchestration and OpenAI’s voice quality in one place, and you pay both bills. That is why I keep saying this is partly build-versus-buy and not a straight versus. Hold that thought, because it matters for the verdict.
The price, told honestly
These two do not price the same thing, so the numbers are not directly comparable, and pretending they are would be the first mistake. Read each on its own terms.
Vapi charges in one unit only. It is $0.05 a minute to host the call, and that is the only number Vapi actually sets. The three moving parts of any voice agent, turning speech into text, the AI working out a reply, and turning that reply back into a voice, are billed straight through from whoever you plug in, at their rates, with no Vapi markup when you bring your own keys. The phone line comes from your carrier. So Vapi’s floor is genuinely cheap, and your real number is whatever your chosen parts add on top, which in a normal stack lands somewhere between $0.05 and $0.30 a minute. The headline is honest, but it is the floor, not the bill.
OpenAI Realtime prices nothing per minute at all. It bills by audio tokens, and that is genuinely harder to predict than a flat rate, so let me show the workings rather than hand you a single figure that does not really exist.
OpenAI’s pricing page lists the flagship gpt-realtime-2 at $32.00 per 1M audio input tokens ($0.40 per 1M if the input is cached) and $64.00 per 1M audio output tokens. Your text context, the system prompt and the running conversation, is billed separately at $4.00 per 1M input ($0.40 cached) and $24.00 per 1M output. There is also a cheaper gpt-realtime-mini at $10.00 in, $0.30 cached, $20.00 out per 1M audio tokens, for when top quality is not the point.
To turn tokens into minutes you need OpenAI’s encoding rule: caller (input) audio is one token per 100 milliseconds, so 600 tokens a spoken minute, and agent (output) audio is one token per 50 milliseconds, so 1,200 tokens a spoken minute. Take a minute split roughly half caller, half agent. That is about 300 input tokens (about $0.01) plus 600 output tokens (about $0.04), so the raw audio is only around $0.05 a minute.
Here is the honest part, and it is the whole reason the band on the table above is so wide. That $0.05 is the floor, not the bill. The expensive bit is the text context, because your system prompt and the whole conversation so far get re-sent on every turn, and a chatty agent with a long prompt racks those text tokens up fast. An independent breakdown across 11 call profiles landed at roughly $0.18 to $0.46 a minute with no caching, dropping to about $0.05 to $0.10 a minute once prompt caching is switched on. Caching matters a lot here, so if you build on this, build it in from day one. The headline figure on this page is an estimate of $0.18 a minute, not a published rate, and the all-in band runs from $0.07 to $0.48 on purpose, because your prompt design genuinely moves the number.
Read what that means rather than racing the two ranges against each other. Vapi gives you a predictable floor and a bill whose shape you understand: a flat platform fee, plus components you chose and can see. OpenAI Realtime gives you a model whose voice is hard to beat, on a bill you have to model rather than read off a page. If a finance person needs to forecast next quarter’s spend, the Vapi shape is easier to defend. If you have an engineer who will tune prompts and turn on caching, the OpenAI number can land low, but it is work to keep it there.
Telephony, the part OpenAI does not give you
A voice agent needs a phone number people can ring or that can dial out. This is where the platform-versus-model split bites hardest.
The OpenAI Realtime API connects over WebRTC, WebSocket or SIP, but it does not give you a phone line. You bring your own telephony, almost always Twilio, and pay Twilio separately, usually around $0.014 a minute on top of the token cost. You also write the glue that bridges the call audio to the model. None of that is exotic, it is the same trade other raw building-block providers ask of you, but it is real work and a second vendor on the bill.
Vapi carries the telephony layer for you. It does SIP trunking, so you can plug in your own phone-number supplier instead of using its numbers. It does warm transfers, handing a live call to a human with the AI’s summary attached. It runs outbound campaigns in bulk, what the schema calls batch calling. The OpenAI Realtime API, being a model, does none of those as a product: no warm transfer, no batch-calling tool, no campaign manager. You would build each of those yourself around the model. Both, for what it is worth, speak MCP, the connection that lets other AI tools trigger and feed calls, so on that one capability they are level.
Who each one is built for
Two clean use-case fits come out of all that, and they sort almost everyone:
- You are a product or engineering team where speech quality is the thing you will not compromise on, and you will write the code. OpenAI Realtime API. You want the tightest speech-to-speech loop you can get, you are comfortable bringing Twilio and building the agent loop, and you would rather own the raw model than rent a layer on top of it. The voice is the reason, and the do-it-yourself assembly is the price.
- You want a phone agent with telephony, orchestration and a dashboard handed to you, on a bill you can forecast. Vapi. You have a developer, but you would rather they configured an agent than built one from a socket up, and you value shipping this month and a predictable per-minute shape over owning every layer. The platform is doing real work for you, and that is most of what the $0.05 floor buys.
The honest test between them is almost a single question. Are you building the agent, or buying the platform that runs it? If you are writing the code that bridges a call to a model and you want the best model in that seat, OpenAI Realtime. If you want the call infrastructure to already exist, Vapi, quite possibly with OpenAI inside it.
The named customers, and what they tell you
This is where I have to be careful, because accuracy is the whole point of this site and the two vendors are not in the same position on references.
Vapi has the names. By TechCrunch’s account, Amazon Ring routes all of its inbound calls through Vapi after evaluating more than forty rival platforms, and Intuit is a named customer too. Those are not logos a finance team picks lightly. For a buyer nervous about building a phone line on a young category, that kind of due diligence by someone else is genuinely reassuring, and it is a point in Vapi’s favour that has nothing to do with price.
OpenAI Realtime is a different case, and I will not pretend otherwise. OpenAI is a household name and its API sits underneath a great deal of what ships in this category, but our sourced data for the Realtime API does not carry named, attributable customer references the way Vapi’s does, so I am not going to invent any. The honest read is that OpenAI’s credibility comes from the platform’s ubiquity rather than from a named-customer list I can point you to on this page. If a specific reference matters to your buying case, ask OpenAI sales directly, and treat anything unsourced you read elsewhere with suspicion.
Control versus convenience
Strip the labels off and this is the old control-versus-convenience trade, sharpened because one side is a bare model.
The OpenAI Realtime API gives you maximum control over the speech layer and minimum convenience everywhere else. You own the model choice, the prompt, the caching strategy, the telephony bridge and the agent logic, all of it, because all of it is yours to write. That is total control. It is also total responsibility: every part that breaks is a part you built.
Vapi gives you a lot of convenience and still a fair amount of control, just one level up. You do not write the call-hosting layer or the telephony glue, that is done. But you still choose your speech, model and voice providers and see every choice on the bill and in the latency, so you are not locked into a black box either. The thing you give up against raw OpenAI is the very last layer of control over the speech loop, and what you get back is not having to build and maintain the plumbing around it.
There is a lock-in angle here too, and it favours the buyer in both cases more than you might fear. The part that is genuinely yours, the prompt, the call flow, the logic of what the agent says and does, is portable thinking, not proprietary code. With the OpenAI Realtime API you are about as close to the metal as you can get, so there is little platform to be locked into, though you are tied to OpenAI as the model vendor. With Vapi, because you bring your own providers, leaving means replacing the thin hosting layer rather than your whole stack, and any numbers you provisioned through the platform need porting. One practical tip either way: keep your prompts, call flows and test scripts in your own repository from day one, not just in a vendor dashboard, so the design you have refined is never trapped behind a login you might one day cancel.
Compliance and trust
If you are in healthcare, finance or anywhere regulated, this section may decide it, so here are the specifics for each.
The OpenAI Realtime API inherits the OpenAI API platform’s posture, which is solid. OpenAI’s enterprise-privacy and trust pages state SOC 2 Type 2 for the API, the ability to sign a HIPAA Business Associate Agreement, GDPR support with a Data Processing Addendum, and ISO 27001 and 27701. That is enough paperwork for a regulated build. The catch is the same one telephony raised: because you assemble the rest of the agent yourself, end-to-end coverage also leans on your phone-line supplier, almost always Twilio, holding its own certifications. The model layer is covered; the parts you bolt around it are on you.
Vapi offers HIPAA too, but as a paid add-on at $2,000 a month, and switching it on means no logs, recordings or transcripts are kept. Zero Data Retention, which keeps nothing at all, is a separate $1,000 a month. SOC 2 Type II, GDPR and PCI DSS are covered at the platform layer, though SOC 2 sits on the enterprise plan. The same caveat applies: Vapi only runs the call, so end-to-end coverage also leans on your speech, model, voice and phone-line suppliers holding their own certifications. And if you are running OpenAI as the model inside a HIPAA Vapi agent, both vendors’ compliance has to line up, so check that explicitly rather than assuming the platform covers the model layer for you.
Both can clear a regulated bar. The difference in shape is that OpenAI gates compliance behind getting a BAA signed on the API account, while Vapi sells it as a flat monthly bolt-on on top of a stack you also have to keep compliant. Neither is a free pass; get the BAA signed before any protected health data goes near either.
What we have not tested yet
Time for the honest limit. The strongest reason to reach for the OpenAI Realtime API is speech quality, the marin and cedar voices on gpt-realtime-2 are about as natural as anything you can buy today, and the tight speech-to-speech loop avoids the robotic seam where a separate transcription step hands off to a model. That is our read from the public information and a first listen, not a measured result. We have not placed our own timed test calls to either OpenAI Realtime or Vapi yet, so you will not find a Voxrater latency figure for either on this page. When the test rig ships, we will run the same scenarios against both and publish p50, p95 and the dates, and if the measured numbers contradict the reputation, the measured numbers win.
The 1 to 10 scores on the vendor pages are an editorial preview too, our provisional read, not yet from blind listening tests or timed calls. They put OpenAI Realtime high on voice quality and lower on ease of use, which matches the do-it-yourself reality, and Vapi ahead on flexibility and value, which matches the platform story. Honest, but provisional, and we will say so until the harness fills in the real numbers.
Three questions that actually decide it
If you want to skip the prose, answer these.
- Are you building the agent, or buying the platform that runs it? Building, with a developer who wants the rawest, best-sounding model, leans OpenAI Realtime. Buying, where you want telephony and a dashboard to already exist, leans Vapi.
- Can you forecast a token-based bill, or do you need a predictable per-minute shape? Comfortable modelling tokens and tuning caching leans OpenAI Realtime. Needing a number finance can defend next quarter leans Vapi.
- Do you actually have to choose? Often not. If you want OpenAI’s voice on infrastructure you did not build, run it as the model inside a Vapi agent and accept paying both bills.
Bottom line
My lean is Vapi for most buyers, and the reasoning is the build effort, not the voice. Pick Vapi when you want a phone agent with the telephony, orchestration, warm transfer, batch calling and dashboard already built, on a per-minute bill whose shape you can forecast. You give up the very last layer of control over the speech loop and you pay a $0.05 floor plus your components, and in return you skip standing the whole thing up from a raw socket.
Pick the OpenAI Realtime API when speech quality is the thing you will not compromise on, you have a developer happy to write the agent loop and bridge Twilio, and you can stomach a token-based bill you model rather than read off a page. The voice is the best reason to choose it. The do-it-yourself assembly and the token pricing are the reasons plenty of teams build on a platform instead.
And if you are torn, remember the third option that this whole page keeps circling back to. Because Vapi can route to OpenAI’s model, you do not have to pick the voice or the platform. Run Vapi as the platform with an OpenAI model inside it, accept the slightly higher per-minute bill, and you get the orchestration of the buy and most of the voice quality of the build. Then read the full OpenAI Realtime API review and Vapi review for the per-plan detail, and run your own numbers in the cost calculator with your real call volume before you commit to either.