How we test
Our methodology
The point of this directory is one thing: numbers you can trust. So here is exactly how we plan to get them, what we measure, and the things we will not do. If you ever doubt a result, this page and our documented method are the receipts.
What we test
Every platform faces the same five call scenarios, so the comparison is like for like. We place 30 calls per platform per scenario (150 calls each), which gives us roughly 300 latency samples to work from. The full scripts will be published when the rig ships; the recordings stay private.
- Outbound appointment booking The agent rings a mock customer to confirm a dental appointment. The script makes it hold, interrupt, politely refuse, and read the booking back.
- Inbound customer support A caller has a billing question and only a partial account number. The agent has to look it up, clarify, and either resolve it or escalate.
- Interruption stress The caller cuts in mid-sentence at three separate points, to test turn-taking and how cleanly the agent stops talking.
- Noisy environment The booking call again, this time over simulated cafe background noise, to test how well the speech-to-text holds up.
- Accent variation The booking call again with three speaker accents (Scottish, Indian English, Filipino English), to test recognition accuracy across voices.
How we measure latency
Latency is the wait that makes an agent feel slow. We measure it as the time from the caller finishing a turn to the agent's first audible word, and report three figures so one good or bad call cannot flatter the result:
- p50 the typical experience (half of calls are faster).
- p95 the slow experience (1 call in 20 is worse than this).
- p99 the worst case (1 call in 100).
Where a platform exposes it, we split that wait into its parts (speech-to-text, the language model, text-to-speech, and network), so you can see where the time goes. For reference, the industry aims for an end-to-end p90 under about 3.5s and a p99 under about 5s.
How we score
Two layers, kept apart. The 1–10 scores on the site today are an editorial preview: our provisional read from public information, so the framework is in place before we place a single call. The measured layer at the bottom replaces that opinion with sourced results once the blind test calls run.
The preview scores you see now
Every platform gets four sub-scores from 1 to 10. We publish all four, never just the headline, because the breakdown is where the useful detail sits.
| Sub-score | What a 10 looks like |
|---|---|
| Voice quality | Natural, expressive speech with no robotic patches, even on long sentences. |
| Voice range | A deep voice library with real control over accent, tone and style, plus cloning. |
| Ease of use | A non-developer can build and ship an agent without writing code or fighting the docs. |
| Value for money | Strong capability for the all-in price, with no surprise add-ons on the bill. |
The Overall is not a fifth opinion. It is a transparent weighted average of those four, and we publish the weights rather than hand-setting a number. We weight the same four criteria differently for different jobs, the way a buyer would: for a phone line, ease and value matter most; for narration, the voice itself is the product.
| Sub-score | All-round | Business calls | Video & narration |
|---|---|---|---|
| Voice quality | 30% | 20% | 40% |
| Voice range | 25% | 15% | 35% |
| Ease of use | 20% | 30% | 15% |
| Value for money | 25% | 35% | 10% |
A worked example. ElevenLabs scores 10, 10, 7 and 6 on the four sub-scores. All-round that comes to 8.4; weighted for narration, where the voice is the point, it rises to 9.2; weighted for calls it is 7.7. The homepage tabs use the job-specific weighting, and the full ranking table uses the all-round one.
What the number, stars and tier mean
One thing to be upfront about: 10 means the best you can get today, not a perfect agent that does not exist. We also only list platforms worth considering in the first place. So real scores sit high and close together, the way a strong, competitive field should. We do not stretch the numbers apart to invent a runaway winner or a loser, because that would be dishonest. Instead we band the score into a plain-English tier and a star gloss, calibrated so a genuinely good platform reads as good:
| Overall | Stars | Tier |
|---|---|---|
| 8.5–10 | 5 | Exceptional |
| 7.5–8.4 | 4.5 | Excellent |
| 6.5–7.4 | 4 | Strong |
| 5.5–6.4 | 3.5 | Capable |
| 4.5–5.4 | 3 | Fair |
| below 4.5 | 2.5 | Limited |
That is why most platforms here land at four stars ("Strong") or four and a half ("Excellent"): they are good, and they are close. Five stars ("Exceptional") is reserved for a genuine category lead, so ElevenLabs earns it for narration (its voice is the standout) but not for calls. Where two platforms share a tier, take that as our honest signal that they are closely matched, and let the use-case tabs, the badges and the one-line reason point you to the right one.
So the opinion lives in the four sub-scores; the maths that combines them, and the bands above, are fixed and public. We do not sell or hand-set the Overall, and we never wrap these scores in star-rating markup that could pass them off to search engines as verified reviews. The preview scores are our read until the blind test calls run. Sub-scores last reviewed: 31 May 2026.
What the blind test calls will measure
Once we place real calls, the editorial read gives way to measured results on five dimensions, each tied to a recording and the documented method on this page.
| Dimension | What a top score looks like |
|---|---|
| Latency | Fast and consistent across every scenario, not just on a good run. |
| Voice quality | Hard to tell from a person in a blind listen, with no robotic patches on long sentences. |
| Accuracy | Gets account numbers, names and dates right across all three accents, and rarely makes things up. |
| Conversation flow | Handles interruptions naturally, does not talk over you, and recovers when things get ambiguous. |
| Integration robustness | The documented integrations (CRM, telephony, tools) work without workarounds. |
What we will not do
- No marketing scoring. We do not hand out points for the nicest dashboard or the smoothest onboarding. Those are subjective and easy for a vendor to influence.
- No paid scores. A platform cannot buy a better rank. If we ever sell placement, it is a separate, clearly-marked slot, and it never moves a ranking (see sponsored placements).
- No quiet retests. If a vendor disputes a result and we re-test, the original result stays on the page next to the new one, with the date it changed.
- No private benchmarks. Every result ties back to a recording (kept privately, played back on challenge) and to the documented method on this page.
How often we re-test
- Every 90 days by default, so prices and performance do not drift out of date.
- On a big change (a new model, a latency claim, a pricing change), we re-test that platform sooner.
- On a dispute, within 14 days, with both results published.
What we disclose with every result
So you can check our working, each test publishes the test region, the date, the harness version, the number of calls, and the scripts and noise samples used.
Conflict of interest
Voxrater is independent. We hold no equity, no advisory role and no paid affiliation with any platform listed here. Some links are affiliate links, and the full commission detail is published on the affiliate-disclosure page. Affiliate income does not move a ranking; this documented methodology is the audit trail.
Open by default
The test scripts, the scoring rubric and the per-result measurements are laid out in the open, here on this page. Only the recordings stay private, because they contain mock callers and sometimes paid actors, and we keep those to defend a result if it is challenged.