Skip to content

Commit db4b00e

Browse files
feat: add LLM eval suite for Payload conventions and code generation (#15710)
## Overview The suite tests two complementary things: - **QA evals** — does the model correctly answer questions about Payload's API and conventions? - **Codegen evals** — can the model apply a specific change to a real `payload.config.ts` file, producing valid TypeScript with the right outcome? Codegen evals use a three-step pipeline: `LLM generation` → `TypeScript compilation` → `LLM scoring`. ## Skills Evaluation Each QA suite runs in two modes to measure the impact of injecting `SKILL.md` as passive context: | Spec file | System prompt | Purpose | | ------------------------------- | --------------------------------- | ----------------------- | | `eval.<suite>.spec.ts` | `qaWithSkill` — SKILL.md injected | Primary eval | | `eval.<suite>.baseline.spec.ts` | `qaNoSkill` — no context doc | Baseline for comparison | Both modes are passive context injection (the document goes directly into the `system:` field). There is no tool-call indirection. The delta between the two is a direct measure of what SKILL.md contributes. > Cache keys include `systemPromptKey`, so `qaWithSkill` and `qaNoSkill` results are always stored as separate entries and never collide. ## Running the evals ```bash # Run all evals (with skill, high-power model) pnpm run test:eval # Run all evals — baseline (no skill context, high-power model) pnpm run test:eval -- eval.baseline # Run a specific suite only pnpm run test:eval -- eval.config pnpm run test:eval -- eval.conventions # Force a fresh run, bypassing the result cache EVAL_NO_CACHE=true pnpm run test:eval # Run with an interactive HTML report (opens in browser after run) pnpm run test:eval:report # Report for a specific suite pnpm run test:eval:report -- eval.config ``` `OPENAI_API_KEY` must be set in your environment. The `test:eval:report` script generates `test/evals/eval-results/report.html` and serves it locally via Vitest UI. The file is gitignored. ## Pipelines ### QA Pipeline ```mermaid flowchart LR qaCase["EvalCase"] optFixture["fixture"] systemPrompt["system prompt\n(qaWithSkill or qaNoSkill)"] runEval["runEval"] scoreAnswer["scoreAnswer"] qaResult["EvalResult"] qaCase --> runEval optFixture -->|"injected into prompt"| runEval systemPrompt --> runEval runEval --> scoreAnswer scoreAnswer --> qaResult ``` ### Codegen Pipeline ```mermaid flowchart LR codegenCase["CodegenEvalCase"] fixture["fixture"] runCodegenEval["runCodegenEval"] tsc["validateConfigTypes"] scoreConfigChange["scoreConfigChange"] codegenResult["EvalResult"] codegenCase --> fixture fixture --> runCodegenEval runCodegenEval --> tsc tsc -->|"valid"| scoreConfigChange tsc -->|"invalid"| codegenResult scoreConfigChange --> codegenResult ``` > The tsc check is the hard gate — if the generated TypeScript does not compile, the case fails immediately without calling the scorer. This keeps the scorer focused on semantic correctness rather than syntax errors. > Codegen always uses the `configModify` system prompt regardless of skill variant. Codegen cache keys do not include `systemPromptKey`, so codegen results are shared between `with-skill` and `baseline` runs — this is intentional and correct. ### Result Caching ```mermaid flowchart LR start["Eval"] cacheCheck{"cache hit?"} cached["cached EvalResult"] run["Run full pipeline"] write["eval-results/cache/<hash>.json"] done["EvalResult"] start --> cacheCheck cacheCheck -->|"yes + EVAL_NO_CACHE unset"| cached cacheCheck -->|"no or EVAL_NO_CACHE=true"| run run --> write write --> done cached --> done ``` Cache keys include the model ID and (for QA) the `systemPromptKey`, so the following never collide: - `eval.spec.ts` (gpt-5.2 + qaWithSkill) - `eval.baseline.spec.ts` (gpt-5.2 + qaNoSkill) - `eval.low-power.spec.ts` (gpt-4o + qaWithSkill) ## Token Usage Tracking Every `EvalResult` includes a `usage` object covering all LLM calls for that case: ```jsonc { "result": { "pass": true, "score": 0.92, "usage": { "runner": { "inputTokens": 3499, "cachedInputTokens": 3328, "outputTokens": 280, "totalTokens": 3779, }, "scorer": { "inputTokens": 669, "cachedInputTokens": 0, "outputTokens": 89, "totalTokens": 758, }, "total": { "inputTokens": 4168, "cachedInputTokens": 3328, "outputTokens": 369, "totalTokens": 4537, }, }, }, } ``` - **`runner`** — tokens spent generating the answer or modified config. - **`scorer`** — tokens spent evaluating the result (consistent across skill variants since the scorer prompt is fixed). - **`total`** — sum of runner + scorer for full per-case cost. - **`cachedInputTokens`** — the key signal for skill efficiency. `qaWithSkill` injects SKILL.md (~3,400 tokens) into every system prompt. Once the API warms the prompt cache, ~95% of those tokens are `cachedInputTokens` (billed at a reduced rate), so the net new tokens per call drops to ~170 — nearly identical to the `qaNoSkill` baseline. For codegen cases that fail tsc, `scorer` is absent and `total` equals `runner`. Usage is stored in the cache alongside the result, so historical runs retain their token data for cost comparisons across model variants and skill configurations. ## Negative Tests The negative suite tests the evaluation pipeline itself as much as the model: | Test | What it checks | | ------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------- | | **Detection (QA)** | Given a broken config, does the model identify the specific error? Expects ≥ 70% accuracy. | | **Correction (Codegen)** | Given a broken config, does the model fix the error? tsc must pass after correction. | | **Invalid instruction** | The model is explicitly told to introduce a bad field type. The test passes only if tsc catches the error and the pipeline correctly reports it as a failure. | The three broken fixtures (`invalid-field-type`, `invalid-access-return`, `missing-beforechange-return`) are shared by both the detection and correction datasets. ## Adding a new eval case **QA case** — add an entry to the appropriate `datasets/<category>/qa.ts`: ```typescript { input: 'How do you configure Payload to send emails?', expected: 'set the email property in buildConfig with an adapter like nodemailerAdapter', category: 'config', } ``` **Codegen case** — create a fixture first, then add the dataset entry: 1. Add `test/evals/fixtures/<category>/codegen/<name>/payload.config.ts` — a minimal but valid config that gives the LLM context for the specific task. 2. Add an entry to `datasets/<category>/codegen.ts`: ```typescript { input: 'Add a text field named "excerpt" to the posts collection.', expected: 'text field with name "excerpt" added to posts.fields', category: 'collections', fixturePath: 'collections/codegen/<name>', } ``` The cache key for codegen includes the fixture file's **content** (not just its path), so updating a fixture automatically invalidates its cached result. ## Admin The admin interface for evals has a way of inspecting cached results. <img width="2318" height="149" alt="image" src="https://github.com/user-attachments/assets/c8c87387-e65f-40b5-8a8f-54701e26a3c7" /> This gives users the ability to find improvements, regressions, and better understand model capabilities. <img width="2343" height="794" alt="image" src="https://github.com/user-attachments/assets/61b41c8c-4802-40c3-a81c-115ed309ae3d" /> ## Debugging failed cases Every failed case writes a JSON file to `eval-results/failed-assertions/<label-slug>/`. For codegen cases this includes the starter config, the LLM-generated config, tsc errors (if any), and the scorer's reasoning. For QA cases it includes the question, expected answer, actual answer, and reasoning. The generated `.ts` files in `eval-results/<category>/codegen/` show the last LLM output for each fixture and can be opened directly in the editor for manual inspection. --------- Co-authored-by: Elliot DeNolf <denolfe@gmail.com>
1 parent 0d15620 commit db4b00e

113 files changed

Lines changed: 6813 additions & 23 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.gitignore

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -358,3 +358,8 @@ payload.db
358358

359359
# Screenshots created by Playwright MCP
360360
.playwright-mcp
361+
362+
# Vitest HTML report generated by test:eval:report
363+
test/evals/eval-results/report.html
364+
# Versioned eval run snapshots (local only — used for run comparison in dashboard)
365+
test/evals/eval-results/runs/

eslint.config.js

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,7 @@ export const rootEslintConfig = [
4848
'packages/**/*.spec.ts',
4949
'templates/**',
5050
'examples/**',
51+
'packages/drizzle/src/postgres/predefinedMigrations/v2-v3/**',
5152
],
5253
},
5354
{

package.json

Lines changed: 40 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -120,6 +120,41 @@
120120
"test:e2e:prod:ci:noturbo": "pnpm prepare-run-test-against-prod:ci && pnpm test:e2e:prod:run:noturbo",
121121
"test:e2e:prod:run": "pnpm runts ./test/runE2E.ts --prod",
122122
"test:e2e:prod:run:noturbo": "pnpm runts ./test/runE2E.ts --prod --no-turbo",
123+
"test:eval": "cross-env NODE_OPTIONS=\"--no-deprecation --no-experimental-strip-types\" NODE_NO_WARNINGS=1 vitest --run --project eval",
124+
"test:eval:baseline": "cross-env EVAL_VARIANT=baseline NODE_OPTIONS=\"--no-deprecation --no-experimental-strip-types\" NODE_NO_WARNINGS=1 vitest --run --project eval",
125+
"test:eval:building-plugins:baseline": "cross-env EVAL_VARIANT=baseline NODE_OPTIONS=\"--no-deprecation --no-experimental-strip-types\" NODE_NO_WARNINGS=1 pnpm exec vitest --run --project eval eval.building-plugins.spec",
126+
"test:eval:building-plugins:low-power": "cross-env EVAL_VARIANT=low-power NODE_OPTIONS=\"--no-deprecation --no-experimental-strip-types\" NODE_NO_WARNINGS=1 pnpm exec vitest --run --project eval eval.building-plugins.spec",
127+
"test:eval:building-plugins:skill": "cross-env NODE_OPTIONS=\"--no-deprecation --no-experimental-strip-types\" NODE_NO_WARNINGS=1 pnpm exec vitest --run --project eval eval.building-plugins.spec",
128+
"test:eval:collections:baseline": "cross-env EVAL_VARIANT=baseline NODE_OPTIONS=\"--no-deprecation --no-experimental-strip-types\" NODE_NO_WARNINGS=1 pnpm exec vitest --run --project eval eval.collections.spec",
129+
"test:eval:collections:low-power": "cross-env EVAL_VARIANT=low-power NODE_OPTIONS=\"--no-deprecation --no-experimental-strip-types\" NODE_NO_WARNINGS=1 pnpm exec vitest --run --project eval eval.collections.spec",
130+
"test:eval:collections:skill": "cross-env NODE_OPTIONS=\"--no-deprecation --no-experimental-strip-types\" NODE_NO_WARNINGS=1 pnpm exec vitest --run --project eval eval.collections.spec",
131+
"test:eval:config:baseline": "cross-env EVAL_VARIANT=baseline NODE_OPTIONS=\"--no-deprecation --no-experimental-strip-types\" NODE_NO_WARNINGS=1 pnpm exec vitest --run --project eval eval.config.spec",
132+
"test:eval:config:low-power": "cross-env EVAL_VARIANT=low-power NODE_OPTIONS=\"--no-deprecation --no-experimental-strip-types\" NODE_NO_WARNINGS=1 pnpm exec vitest --run --project eval eval.config.spec",
133+
"test:eval:config:skill": "cross-env NODE_OPTIONS=\"--no-deprecation --no-experimental-strip-types\" NODE_NO_WARNINGS=1 pnpm exec vitest --run --project eval eval.config.spec",
134+
"test:eval:conventions:baseline": "cross-env EVAL_VARIANT=baseline NODE_OPTIONS=\"--no-deprecation --no-experimental-strip-types\" NODE_NO_WARNINGS=1 pnpm exec vitest --run --project eval eval.conventions.spec",
135+
"test:eval:conventions:low-power": "cross-env EVAL_VARIANT=low-power NODE_OPTIONS=\"--no-deprecation --no-experimental-strip-types\" NODE_NO_WARNINGS=1 pnpm exec vitest --run --project eval eval.conventions.spec",
136+
"test:eval:conventions:skill": "cross-env NODE_OPTIONS=\"--no-deprecation --no-experimental-strip-types\" NODE_NO_WARNINGS=1 pnpm exec vitest --run --project eval eval.conventions.spec",
137+
"test:eval:fields:baseline": "cross-env EVAL_VARIANT=baseline NODE_OPTIONS=\"--no-deprecation --no-experimental-strip-types\" NODE_NO_WARNINGS=1 pnpm exec vitest --run --project eval eval.fields.spec",
138+
"test:eval:fields:low-power": "cross-env EVAL_VARIANT=low-power NODE_OPTIONS=\"--no-deprecation --no-experimental-strip-types\" NODE_NO_WARNINGS=1 pnpm exec vitest --run --project eval eval.fields.spec",
139+
"test:eval:fields:skill": "cross-env NODE_OPTIONS=\"--no-deprecation --no-experimental-strip-types\" NODE_NO_WARNINGS=1 pnpm exec vitest --run --project eval eval.fields.spec",
140+
"test:eval:graphql:baseline": "cross-env EVAL_VARIANT=baseline NODE_OPTIONS=\"--no-deprecation --no-experimental-strip-types\" NODE_NO_WARNINGS=1 pnpm exec vitest --run --project eval eval.graphql.spec",
141+
"test:eval:graphql:low-power": "cross-env EVAL_VARIANT=low-power NODE_OPTIONS=\"--no-deprecation --no-experimental-strip-types\" NODE_NO_WARNINGS=1 pnpm exec vitest --run --project eval eval.graphql.spec",
142+
"test:eval:graphql:skill": "cross-env NODE_OPTIONS=\"--no-deprecation --no-experimental-strip-types\" NODE_NO_WARNINGS=1 pnpm exec vitest --run --project eval eval.graphql.spec",
143+
"test:eval:local-api:baseline": "cross-env EVAL_VARIANT=baseline NODE_OPTIONS=\"--no-deprecation --no-experimental-strip-types\" NODE_NO_WARNINGS=1 pnpm exec vitest --run --project eval eval.local-api.spec",
144+
"test:eval:local-api:low-power": "cross-env EVAL_VARIANT=low-power NODE_OPTIONS=\"--no-deprecation --no-experimental-strip-types\" NODE_NO_WARNINGS=1 pnpm exec vitest --run --project eval eval.local-api.spec",
145+
"test:eval:local-api:skill": "cross-env NODE_OPTIONS=\"--no-deprecation --no-experimental-strip-types\" NODE_NO_WARNINGS=1 pnpm exec vitest --run --project eval eval.local-api.spec",
146+
"test:eval:low-power": "cross-env EVAL_VARIANT=low-power NODE_OPTIONS=\"--no-deprecation --no-experimental-strip-types\" NODE_NO_WARNINGS=1 vitest --run --project eval",
147+
"test:eval:negative:baseline": "cross-env EVAL_VARIANT=baseline NODE_OPTIONS=\"--no-deprecation --no-experimental-strip-types\" NODE_NO_WARNINGS=1 pnpm exec vitest --run --project eval eval.negative.spec",
148+
"test:eval:negative:low-power": "cross-env EVAL_VARIANT=low-power NODE_OPTIONS=\"--no-deprecation --no-experimental-strip-types\" NODE_NO_WARNINGS=1 pnpm exec vitest --run --project eval eval.negative.spec",
149+
"test:eval:negative:skill": "cross-env NODE_OPTIONS=\"--no-deprecation --no-experimental-strip-types\" NODE_NO_WARNINGS=1 pnpm exec vitest --run --project eval eval.negative.spec",
150+
"test:eval:official-plugins:baseline": "cross-env EVAL_VARIANT=baseline NODE_OPTIONS=\"--no-deprecation --no-experimental-strip-types\" NODE_NO_WARNINGS=1 pnpm exec vitest --run --project eval eval.official-plugins.spec",
151+
"test:eval:official-plugins:low-power": "cross-env EVAL_VARIANT=low-power NODE_OPTIONS=\"--no-deprecation --no-experimental-strip-types\" NODE_NO_WARNINGS=1 pnpm exec vitest --run --project eval eval.official-plugins.spec",
152+
"test:eval:official-plugins:skill": "cross-env NODE_OPTIONS=\"--no-deprecation --no-experimental-strip-types\" NODE_NO_WARNINGS=1 pnpm exec vitest --run --project eval eval.official-plugins.spec",
153+
"test:eval:report": "cross-env NODE_OPTIONS=\"--no-deprecation --no-experimental-strip-types\" NODE_NO_WARNINGS=1 vitest --run --project eval --reporter=default --reporter=html --outputFile.html=test/evals/eval-results/report.html",
154+
"test:eval:rest-api:baseline": "cross-env EVAL_VARIANT=baseline NODE_OPTIONS=\"--no-deprecation --no-experimental-strip-types\" NODE_NO_WARNINGS=1 pnpm exec vitest --run --project eval eval.rest-api.spec",
155+
"test:eval:rest-api:low-power": "cross-env EVAL_VARIANT=low-power NODE_OPTIONS=\"--no-deprecation --no-experimental-strip-types\" NODE_NO_WARNINGS=1 pnpm exec vitest --run --project eval eval.rest-api.spec",
156+
"test:eval:rest-api:skill": "cross-env NODE_OPTIONS=\"--no-deprecation --no-experimental-strip-types\" NODE_NO_WARNINGS=1 pnpm exec vitest --run --project eval eval.rest-api.spec",
157+
"test:eval:skill": "cross-env NODE_OPTIONS=\"--no-deprecation --no-experimental-strip-types\" NODE_NO_WARNINGS=1 vitest --run --project eval",
123158
"test:int": "cross-env NODE_OPTIONS=\"--no-deprecation --no-experimental-strip-types\" NODE_NO_WARNINGS=1 DISABLE_LOGGING=true vitest --project int",
124159
"test:int:firestore": "cross-env NODE_OPTIONS=\"--no-deprecation --no-experimental-strip-types\" NODE_NO_WARNINGS=1 PAYLOAD_DATABASE=firestore DISABLE_LOGGING=true vitest --project int",
125160
"test:int:postgres": "cross-env NODE_OPTIONS=\"--no-deprecation --no-experimental-strip-types\" NODE_NO_WARNINGS=1 PAYLOAD_DATABASE=postgres DISABLE_LOGGING=true vitest --project int",
@@ -141,6 +176,7 @@
141176
"README.md": "sh -c 'cp ./README.md ./packages/payload/README.md'"
142177
},
143178
"devDependencies": {
179+
"@ai-sdk/openai": "3.0.30",
144180
"@axe-core/playwright": "4.11.0",
145181
"@libsql/client": "0.14.0",
146182
"@next/bundle-analyzer": "16.2.1",
@@ -161,6 +197,8 @@
161197
"@types/react": "19.2.9",
162198
"@types/react-dom": "19.2.3",
163199
"@types/shelljs": "0.8.15",
200+
"@vitest/ui": "4.0.15",
201+
"ai": "6.0.95",
164202
"axe-core": "4.11.0",
165203
"chalk": "^4.1.2",
166204
"comment-json": "^4.2.3",
@@ -203,7 +241,8 @@
203241
"turbo": "^2.5.4",
204242
"typescript": "5.7.3",
205243
"vitest": "4.0.15",
206-
"wrangler": "~4.61.1"
244+
"wrangler": "~4.61.1",
245+
"zod": "4.3.6"
207246
},
208247
"packageManager": "pnpm@10.27.0",
209248
"engines": {

0 commit comments

Comments
 (0)