Location
workfromhome
Posted
July 02, 2026
Commute
Local Area
Local Opportunity Near You!
This job is in your area. Enjoy a short commute and work close to home.
Job Description
Owns the eval harness and quality gate from the beginning. This role replaces the old late‑stage “Evals Specialist” model with a standing owner for measurable agent quality.
Key Responsibilities
- Build and maintain the MVP eval harness: golden tasks, exception tasks, scorecard metrics, and regression packs.
- Wire evals into CI so quality regressions fail builds and releases.
- Define and maintain release‑gate thresholds with Product and the Tech Lead.
- Lay the path for later adversarial and drift‑testing expansion without overbuilding MVP scope.
Requirements
- Experience evaluating ML, LLM, or non‑deterministic systems.
- Strong test and benchmark design capability.
- Comfort working with noisy metrics, thresholds, and probabilistic behavior.
- Good scripting and automation skills.
- Uses AI to generate candidate eval cases and failure hypotheses, but never confuses gene...