Quaver | Lovable for Evals and Benchmarks
What I Learned Building 2 AI Benchmarks From Scratch (And Why I Automated the Third)
This project was submitted by Faw Ali. It was an Outstanding Submmission for the Technical AI Safety Project Sprint (Jan 2026). Participants worked on these projects for 5 weeks. The text below is an excerpt from the final project.
TL;DR
I built Rideshare-Bench, a benchmark that drops an AI into a simulated gig economy and tests whether it drives safely, earns well, and treats passengers fairly. Claude Sonnet 4.5 earned $1,871 over 12 simulated days, about half of what was possible. It chased surge pricing into crowded zones and drove through exhaustion at 15% accident risk. After building two benchmarks from scratch, I noticed the same boilerplate every time, so I generalized it into Quaver: describe a benchmark in plain language, and an AI agent generates the full scaffold. Run it against models, then analyze the results. Private evals matter because public ones get gamed. I’m building Ocarina Labs to turn this into independent pre-deployment safety testing for AI agents.
Full Project
You can view the full project here.


