Weight: 20% of overall score · How the overall score is calculated
Definition
Oversight Quality measures whether you catch and correct bad outputs at the right frequency. Unlike every other dimension, this is not a "higher is always better" metric. The optimal range is a moderate correction rate — high enough to show active supervision, low enough to show that delegation is working.
Two failure modes exist: passive acceptance (zero corrections, implying all outputs are accepted uncritically) and over-correction (constant redirection, implying either poor initial task framing or misplaced distrust).
How it's measured
Two sources are combined to detect oversight events in each session:
Keyword-based correction count — human turns are scanned for the following correction signals. Each match counts as one oversight event:
no,wrong,not that,don't,stop,wait,actually,instead,undo,revert,that's not,not right,incorrect,you missed,you forgot
LLM-classified events — an LLM reviewer classifies each human turn as one of: correction, redirection, validation, or pure_input. Corrections and redirections count as oversight events.
The LLM classification takes precedence when available. The correction rate is calculated as:
correction rate = oversight events / total human turns
This rate is then mapped to a score using an inverted-U curve. The peak is at 20% correction rate = score 10:
| Correction rate | Score | Signal |
|---|---|---|
| 20% | 10 | Optimal — active, calibrated supervision |
| 10–30% | 8–10 | Strong oversight band |
| 5–10% | 5–7 | Under-supervising |
| < 5% | 1–4 | Passive — outputs accepted without verification |
| 30–50% | 5–7 | Over-correcting |
| > 50% | 1–4 | Micro-managing — poor delegation or task mismatch |
| 0% | 1 | No oversight detected |
What high vs low looks like
High (score 8–10)
- Reviewing Claude's output before accepting it, with targeted corrections when something is wrong
- Catching logical errors, wrong assumptions, or missed edge cases — then redirecting precisely
- Validation turns ("this looks right, continue") count positively
- Correction rate lands in the 10–30% range across sessions
Low — passive (score 1–3, correction rate near 0%)
- Accepting all outputs without review
- No corrections across multiple sessions
- Treating Claude's output as final rather than as a first draft to validate
Low — over-correcting (score 1–4, correction rate > 40%)
- Constant redirections suggesting the task was poorly scoped from the start
- Re-explaining the same requirement multiple times per session
- Using Claude interactively rather than as an autonomous tool
Behavioural patterns in real sessions
Anthropic's work study raises a specific concern about oversight that is worth quoting directly: supervision requires the same coding expertise that delegation may erode over time. Engineers who offload coding to Claude may gradually lose the technical depth needed to evaluate whether Claude's output is correct — creating a compounding risk as autonomy increases.
This is why Oversight Quality carries significant weight even as Autonomy Calibration rewards longer uninterrupted runs. The two dimensions create a productive tension: grant autonomy, but verify output. The research found that engineers who maintained high oversight quality alongside high autonomy were the ones whose complexity scores rose fastest over the study period.
The cohort data also shows that 55% of Anthropic engineers use Claude daily for debugging — a task type where output verification is built into the workflow (run the tests, see if the bug is fixed). Debugging sessions tend to produce naturally high Oversight Quality scores because verification is inherent to the task.
How it affects your overall score
Oversight Quality carries 20% of your total score.
A one-point improvement in this dimension adds 0.20 points to your overall score.
Because the scoring curve is non-linear (it peaks in a band, not at the maximum), this is the one dimension where reducing a behavior — specifically, reducing over-correction — can raise your score.
It interacts strongly with Autonomy Calibration (high autonomy only earns its score if oversight quality remains healthy) and Delegation Intelligence (well-chosen tasks are easier to verify).