Commit db4b00e
feat: add LLM eval suite for Payload conventions and code generation (#15710)
## Overview
The suite tests two complementary things:
- **QA evals** — does the model correctly answer questions about
Payload's API and conventions?
- **Codegen evals** — can the model apply a specific change to a real
`payload.config.ts` file, producing valid TypeScript with the right
outcome?
Codegen evals use a three-step pipeline: `LLM generation` → `TypeScript
compilation` → `LLM scoring`.
## Skills Evaluation
Each QA suite runs in two modes to measure the impact of injecting
`SKILL.md` as passive context:
| Spec file | System prompt | Purpose |
| ------------------------------- | --------------------------------- |
----------------------- |
| `eval.<suite>.spec.ts` | `qaWithSkill` — SKILL.md injected | Primary
eval |
| `eval.<suite>.baseline.spec.ts` | `qaNoSkill` — no context doc |
Baseline for comparison |
Both modes are passive context injection (the document goes directly
into the `system:` field). There is no tool-call indirection. The delta
between the two is a direct measure of what SKILL.md contributes.
> Cache keys include `systemPromptKey`, so `qaWithSkill` and `qaNoSkill`
results are always stored as separate entries and never collide.
## Running the evals
```bash
# Run all evals (with skill, high-power model)
pnpm run test:eval
# Run all evals — baseline (no skill context, high-power model)
pnpm run test:eval -- eval.baseline
# Run a specific suite only
pnpm run test:eval -- eval.config
pnpm run test:eval -- eval.conventions
# Force a fresh run, bypassing the result cache
EVAL_NO_CACHE=true pnpm run test:eval
# Run with an interactive HTML report (opens in browser after run)
pnpm run test:eval:report
# Report for a specific suite
pnpm run test:eval:report -- eval.config
```
`OPENAI_API_KEY` must be set in your environment.
The `test:eval:report` script generates
`test/evals/eval-results/report.html` and serves it locally via Vitest
UI. The file is gitignored.
## Pipelines
### QA Pipeline
```mermaid
flowchart LR
qaCase["EvalCase"]
optFixture["fixture"]
systemPrompt["system prompt\n(qaWithSkill or qaNoSkill)"]
runEval["runEval"]
scoreAnswer["scoreAnswer"]
qaResult["EvalResult"]
qaCase --> runEval
optFixture -->|"injected into prompt"| runEval
systemPrompt --> runEval
runEval --> scoreAnswer
scoreAnswer --> qaResult
```
### Codegen Pipeline
```mermaid
flowchart LR
codegenCase["CodegenEvalCase"]
fixture["fixture"]
runCodegenEval["runCodegenEval"]
tsc["validateConfigTypes"]
scoreConfigChange["scoreConfigChange"]
codegenResult["EvalResult"]
codegenCase --> fixture
fixture --> runCodegenEval
runCodegenEval --> tsc
tsc -->|"valid"| scoreConfigChange
tsc -->|"invalid"| codegenResult
scoreConfigChange --> codegenResult
```
> The tsc check is the hard gate — if the generated TypeScript does not
compile, the case fails immediately without calling the scorer. This
keeps the scorer focused on semantic correctness rather than syntax
errors.
> Codegen always uses the `configModify` system prompt regardless of
skill variant. Codegen cache keys do not include `systemPromptKey`, so
codegen results are shared between `with-skill` and `baseline` runs —
this is intentional and correct.
### Result Caching
```mermaid
flowchart LR
start["Eval"]
cacheCheck{"cache hit?"}
cached["cached EvalResult"]
run["Run full pipeline"]
write["eval-results/cache/<hash>.json"]
done["EvalResult"]
start --> cacheCheck
cacheCheck -->|"yes + EVAL_NO_CACHE unset"| cached
cacheCheck -->|"no or EVAL_NO_CACHE=true"| run
run --> write
write --> done
cached --> done
```
Cache keys include the model ID and (for QA) the `systemPromptKey`, so
the following never collide:
- `eval.spec.ts` (gpt-5.2 + qaWithSkill)
- `eval.baseline.spec.ts` (gpt-5.2 + qaNoSkill)
- `eval.low-power.spec.ts` (gpt-4o + qaWithSkill)
## Token Usage Tracking
Every `EvalResult` includes a `usage` object covering all LLM calls for
that case:
```jsonc
{
"result": {
"pass": true,
"score": 0.92,
"usage": {
"runner": {
"inputTokens": 3499,
"cachedInputTokens": 3328,
"outputTokens": 280,
"totalTokens": 3779,
},
"scorer": {
"inputTokens": 669,
"cachedInputTokens": 0,
"outputTokens": 89,
"totalTokens": 758,
},
"total": {
"inputTokens": 4168,
"cachedInputTokens": 3328,
"outputTokens": 369,
"totalTokens": 4537,
},
},
},
}
```
- **`runner`** — tokens spent generating the answer or modified config.
- **`scorer`** — tokens spent evaluating the result (consistent across
skill variants since the scorer prompt is fixed).
- **`total`** — sum of runner + scorer for full per-case cost.
- **`cachedInputTokens`** — the key signal for skill efficiency.
`qaWithSkill` injects SKILL.md (~3,400 tokens) into every system prompt.
Once the API warms the prompt cache, ~95% of those tokens are
`cachedInputTokens` (billed at a reduced rate), so the net new tokens
per call drops to ~170 — nearly identical to the `qaNoSkill` baseline.
For codegen cases that fail tsc, `scorer` is absent and `total` equals
`runner`.
Usage is stored in the cache alongside the result, so historical runs
retain their token data for cost comparisons across model variants and
skill configurations.
## Negative Tests
The negative suite tests the evaluation pipeline itself as much as the
model:
| Test | What it checks |
| ------------------------ |
-------------------------------------------------------------------------------------------------------------------------------------------------------------
|
| **Detection (QA)** | Given a broken config, does the model identify
the specific error? Expects ≥ 70% accuracy. |
| **Correction (Codegen)** | Given a broken config, does the model fix
the error? tsc must pass after correction. |
| **Invalid instruction** | The model is explicitly told to introduce a
bad field type. The test passes only if tsc catches the error and the
pipeline correctly reports it as a failure. |
The three broken fixtures (`invalid-field-type`,
`invalid-access-return`, `missing-beforechange-return`) are shared by
both the detection and correction datasets.
## Adding a new eval case
**QA case** — add an entry to the appropriate
`datasets/<category>/qa.ts`:
```typescript
{
input: 'How do you configure Payload to send emails?',
expected: 'set the email property in buildConfig with an adapter like nodemailerAdapter',
category: 'config',
}
```
**Codegen case** — create a fixture first, then add the dataset entry:
1. Add `test/evals/fixtures/<category>/codegen/<name>/payload.config.ts`
— a minimal but valid config that gives the LLM context for the specific
task.
2. Add an entry to `datasets/<category>/codegen.ts`:
```typescript
{
input: 'Add a text field named "excerpt" to the posts collection.',
expected: 'text field with name "excerpt" added to posts.fields',
category: 'collections',
fixturePath: 'collections/codegen/<name>',
}
```
The cache key for codegen includes the fixture file's **content** (not
just its path), so updating a fixture automatically invalidates its
cached result.
## Admin
The admin interface for evals has a way of inspecting cached results.
<img width="2318" height="149" alt="image"
src="https://github.com/user-attachments/assets/c8c87387-e65f-40b5-8a8f-54701e26a3c7"
/>
This gives users the ability to find improvements, regressions, and
better understand model capabilities.
<img width="2343" height="794" alt="image"
src="https://github.com/user-attachments/assets/61b41c8c-4802-40c3-a81c-115ed309ae3d"
/>
## Debugging failed cases
Every failed case writes a JSON file to
`eval-results/failed-assertions/<label-slug>/`. For codegen cases this
includes the starter config, the LLM-generated config, tsc errors (if
any), and the scorer's reasoning. For QA cases it includes the question,
expected answer, actual answer, and reasoning.
The generated `.ts` files in `eval-results/<category>/codegen/` show the
last LLM output for each fixture and can be opened directly in the
editor for manual inspection.
---------
Co-authored-by: Elliot DeNolf <denolfe@gmail.com>1 parent 0d15620 commit db4b00e
113 files changed
Lines changed: 6813 additions & 23 deletions
File tree
- test/evals
- components
- DashboardInfo
- EvalDashboard
- datasets
- collections
- config
- conventions
- fields
- graphql/collections
- local-api/collections
- negative
- plugins
- official
- rest-api/crud
- fixtures
- collections/codegen
- beforechange-hook
- categories-relationship
- comments-relationships
- media-access-control
- posts-title-content
- config/codegen
- admin-components
- cors-serverurl
- localization
- oninit-admin-user
- seo-plugin
- fields/codegen
- array-images
- blocks-layout
- checkbox-ispublished
- group-seo
- number-price
- select-status
- negative/codegen
- invalid-access-return
- invalid-field-type
- missing-beforechange-return
- plugins
- codegen
- enabled-option
- oninit-logging
- tenant-relationship
- with-timestamps
- official/codegen
- ecommerce
- form-builder
- import-export
- mcp
- multi-tenant
- nested-docs
- redirects
- search
- sentry
- seo
- stripe
- icons
- runner
- scorer
- suites
- utils
Some content is hidden
Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
358 | 358 | | |
359 | 359 | | |
360 | 360 | | |
| 361 | + | |
| 362 | + | |
| 363 | + | |
| 364 | + | |
| 365 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
48 | 48 | | |
49 | 49 | | |
50 | 50 | | |
| 51 | + | |
51 | 52 | | |
52 | 53 | | |
53 | 54 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
120 | 120 | | |
121 | 121 | | |
122 | 122 | | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
123 | 158 | | |
124 | 159 | | |
125 | 160 | | |
| |||
141 | 176 | | |
142 | 177 | | |
143 | 178 | | |
| 179 | + | |
144 | 180 | | |
145 | 181 | | |
146 | 182 | | |
| |||
161 | 197 | | |
162 | 198 | | |
163 | 199 | | |
| 200 | + | |
| 201 | + | |
164 | 202 | | |
165 | 203 | | |
166 | 204 | | |
| |||
203 | 241 | | |
204 | 242 | | |
205 | 243 | | |
206 | | - | |
| 244 | + | |
| 245 | + | |
207 | 246 | | |
208 | 247 | | |
209 | 248 | | |
| |||
0 commit comments