Audit your AI agent before users do.

Proofloop hits your endpoint with the same prompts dozens of times, across multiple models, and surfaces where it breaks — the prompt, the agent's output, and a plain-English explanation of what went wrong.

01Watch it run

Or skip the wait — see a finished audit.

View a sample report
02The problem

One run proves nothing.

Your agent works in your demo. In production, the same prompt produces different outputs across reps, across models, and across edge cases you didn't think of. Public benchmarks don't catch this — they test models in isolation, not your endpoint.

Until you've audited your agent systematically — with real prompts, repeated reps, and side-by-side models — you're just hoping. Hope is not a deployment plan.

03How it works

From endpoint to pass^k in three steps.

No dataset. No YAML. No framework lock-in.

01

Paste your endpoint URL.

POST acme.com/agent

OpenAI-compatible · Anthropic · LangServe · Mastra · Custom

02

Pick a stress test — or try Stagehand.

Stagehand · books

3 prompts · easy / medium / hard

Or describe your test in one line.

03

Get pass@k across models and reps.

pass@5 = 0.91

95% CI 0.78–0.97

Per model · per prompt · with failure clusters

04The output

What an actual report looks like.

Pass@k with confidence intervals, clustered failure modes, with the prompt and the agent's output side by side.

stagehand-books

Run 0042·POST /api/demo-agents/stagehand-books

Reliability

How consistently your agent worked across all retries — and where it broke.

78%

7 of 9 cases consistent across all reps

Pass rate

91%

41 / 45 trials

The patterns the auditor surfaced

Each one is a way an agent like yours could quietly fail in production.

  • Skipped the £20 price filter and picked an over-budget book

    promptRecommend a well-reviewed book under £20 that is currently in stock…

    Triggered when Haiku 4.5 returned 'Tipping the Velvet' at £53.74, ignoring the under-£20 constraint on 4 of 5 attempts.

  • Missed the in-stock check on the rating filter

    promptFind a book on the homepage with a 5-star rating that is currently in stock…

    Triggered when GPT-4o-mini extracted the first 5-star match without verifying availability on 3 of 5 attempts.

  • Navigated to the wrong category before extracting

    promptNavigate to the 'Travel' category, open any book in it…

    Triggered when Sonnet 4.5 opened the 'Travel' breadcrumb instead of the category page on 1 of 5 attempts.

05What's next

Single-turn today. Multi-turn next.

Today Proofloop measures single-turn correctness — pass@k across models, repetitions, and prompts. That's the audit that already runs on the demo.

The next chapter is multi-turn: personas that push back, clarify, and interrupt. The hardest place to test your agent is when users don't take the first answer¹. We're building toward that.

¹ pass@1 of 61% drops to pass@8 of 25% on multi-turn agent tasks. Sierra τ-bench, 2025.

Don't let your users be your test suite.