Run ad9d85fe
https://proof-loop.vercel.app/***
- Mode
- Light
- Total trials
- 45
- Starter pack
- Stagehand-books
- Duration
- 4:22
5 prompts × 3 variations × 5 attempts
- Seed
- Loop
- Audit
What we're testing
Stagehand opens books.toscrape.com in a real browser and extracts data. We send 5 prompts, run each with Sonnet 4.5, Haiku 4.5, and GPT-4o mini, 5 attempts each. Watch where they break.
Trial grid
- Passed
- Partial
- Failed
Click any cell to inspect the trial →
Recommend a well-reviewed book under £20 that is currently in stock, and extract its title, price, in-stock status and rating.
Sonnet 4.5
5 / 5
Haiku 4.5
4 / 5
GPT-4o mini
5 / 5
Navigate to the 'Travel' category, open any book in it, and extract its title, price, in-stock status and rating.
Sonnet 4.5
5 / 5
Haiku 4.5
5 / 5
GPT-4o mini
4 / 5
Find a book on the homepage with a 5-star rating that is currently in stock, open it, and extract its title, price, in-stock status and rating.
Sonnet 4.5
1 / 5
Haiku 4.5
3 / 5
GPT-4o mini
1 / 5
(pending)
Sonnet 4.5
0 / 5
Haiku 4.5
0 / 5
GPT-4o mini
0 / 5
(pending)
Sonnet 4.5
0 / 5
Haiku 4.5
0 / 5
GPT-4o mini
0 / 5
Reliability
How consistently your agent worked across all retries — and where it broke.
44%
4 of 9 cases consistent across all reps
Pass rate
73%
33 / 45 trials
Highlight failures
12 failures across 2 patterns
GPT-4o mini · 11 attempts failed
"Recommend a well-reviewed book under £20 that is currently in stock, and extract its title, price, in-stock status and rating."
The agent encountered a server error and couldn't retrieve any book information, so it failed to provide the title, price, or stock status you requested.
Haiku 4.5 · 1 attempt failed
"Find a book on the homepage with a 5-star rating that is currently in stock, open it, and extract its title, price, in-stock status and rating."
The agent selected a book with a one-star rating instead of finding one with a five-star rating as instructed.