Audit your AI agent before users do.
Proofloop hits your endpoint with the same prompts dozens of times, across multiple models, and surfaces where it breaks — the prompt, the agent's output, and a plain-English explanation of what went wrong.
Or skip the wait — see a finished audit.
View a sample reportOne run proves nothing.
Your agent works in your demo. In production, the same prompt produces different outputs across reps, across models, and across edge cases you didn't think of. Public benchmarks don't catch this — they test models in isolation, not your endpoint.
Until you've audited your agent systematically — with real prompts, repeated reps, and side-by-side models — you're just hoping. Hope is not a deployment plan.
From endpoint to pass^k in three steps.
No dataset. No YAML. No framework lock-in.
Paste your endpoint URL.
POST acme.com/agent
OpenAI-compatible · Anthropic · LangServe · Mastra · Custom
Pick a stress test — or try Stagehand.
Stagehand · books
3 prompts · easy / medium / hard
Or describe your test in one line.
Get pass@k across models and reps.
pass@5 = 0.91
95% CI 0.78–0.97
Per model · per prompt · with failure clusters
What an actual report looks like.
Pass@k with confidence intervals, clustered failure modes, with the prompt and the agent's output side by side.
Reliability
How consistently your agent worked across all retries — and where it broke.
78%
7 of 9 cases consistent across all reps
Pass rate
91%
41 / 45 trials
The patterns the auditor surfaced
Each one is a way an agent like yours could quietly fail in production.
Skipped the £20 price filter and picked an over-budget book
prompt›Recommend a well-reviewed book under £20 that is currently in stock…
Triggered when Haiku 4.5 returned 'Tipping the Velvet' at £53.74, ignoring the under-£20 constraint on 4 of 5 attempts.
Missed the in-stock check on the rating filter
prompt›Find a book on the homepage with a 5-star rating that is currently in stock…
Triggered when GPT-4o-mini extracted the first 5-star match without verifying availability on 3 of 5 attempts.
Navigated to the wrong category before extracting
prompt›Navigate to the 'Travel' category, open any book in it…
Triggered when Sonnet 4.5 opened the 'Travel' breadcrumb instead of the category page on 1 of 5 attempts.
Single-turn today. Multi-turn next.
Today Proofloop measures single-turn correctness — pass@k across models, repetitions, and prompts. That's the audit that already runs on the demo.
The next chapter is multi-turn: personas that push back, clarify, and interrupt. The hardest place to test your agent is when users don't take the first answer¹. We're building toward that.
¹ pass@1 of 61% drops to pass@8 of 25% on multi-turn agent tasks. Sierra τ-bench, 2025.