[evals] Grade knowledge-injection mechanisms on output quality by Janpot · Pull Request #4989 · mui/base-ui

Janpot · 2026-06-09T10:59:38Z

Checkpoint of the agent-knowledge eval matrix at test/evals/. Six knowledge-injection mechanisms × five fixtures × Claude sonnet, each producing an App.tsx that gets graded both deterministically and qualitatively.

Note

The matrix was run once per cell to keep cost and runtime down. For conclusive results we'd want N runs per cell plus an LLM-as-judge grading pass; treat the grades below as preliminary. They do match prior expectations though — a workspace AGENTS.md pointing at prose docs is the strongest channel.
Not intended to merge. This branch is an exploration; the results inform what we recommend, but the harness and fixtures stay out of master.

How to read the eval

Each fixture asks the agent to build a small Base UI app under conditions the model was not trained on (a renamed part, a new prop, a brand-new component, prose-only composition). The "mechanism" controls what knowledge channel the sandbox exposes:

cc-baseline — nothing extra; the model relies on its training and on reading .d.ts files in node_modules.
cc-bundled-docs — docs/react/components/*.md shipped inside the package; no pointer.
cc-bundled-docs-agents-md — bundled docs + a workspace AGENTS.md pointing at them.
cc-bundled-docs-dts — bundled docs + a @packageDocumentation preamble on index.d.ts.
cc-bundled-docs-skill — bundled docs + a SKILL.md advertised in the workspace.
cc-inline-dts-agents-md — anatomy JSDoc inlined into the .d.ts files + a workspace AGENTS.md pointing at a package-local AGENTS.md shipped alongside the types.

LLM grading of `App.tsx` outputs

Letter grades reflect how close the output is to idiomatic 1.5 Combobox usage: subpath import, items prop on Root, function-child <List> for filtering, <Portal> layering, correct Empty/List/RecentSearches composition, and the per-fixture API the prompt expects.

`combobox` — basic fruit picker

Sanity check. A vanilla fruit picker using only APIs that exist in the model's training data — measures baseline fluency with no post-cutoff surprises.

Mechanism	Grade	Notes
`cc-baseline`	D	Root import; hand-rolls a `FruitList` wrapper around `Combobox.useFilteredItems` (internal hook) and `.map`s items manually. Wrong API on every axis.
`cc-bundled-docs`	B−	Function-child `<List>` ✓, generic `Root<Fruit>` ✓, but imports from `@base-ui/react` (no subpath). No `<Portal>`.
`cc-bundled-docs-agents-md`	A	Subpath import, `items` prop, function-child, `<Portal>`, `<ItemIndicator>`, `Empty` before `List`. Reads like a hand-tuned demo.
`cc-bundled-docs-dts`	A−	Same shape as above; skips `Label` and self-closes `Clear`/`Trigger` with `aria-label`.
`cc-bundled-docs-skill`	B	Subpath ✓, `items` ✓, `<Portal>` ✓ — but `.map` inside `<List>` instead of function-child.
`cc-inline-dts-agents-md`	B	Same trade-off as `skill`: structure is right, but it `.map`s items rather than using the function-child filtering API.

`new-prop` — `<Combobox.Clear closeOnClear>`

Tests discovery of a synthetic post-cutoff closeOnClear prop on <Combobox.Clear>. The model can't have seen this prop; it has to find it in the injected channel.

Mechanism	Grade	Notes
`cc-baseline`	C	Root import; `.map` inside `<List>`; missing `<Portal>`. Gets `closeOnClear` from training.
`cc-bundled-docs`	C+	Wrong import; function-child ✓; over-engineers a custom `filter={…}` that duplicates what the items API already does.
`cc-bundled-docs-agents-md`	A	Clean: subpath, function-child, `<Portal>`, `closeOnClear`, `<ItemIndicator>`.
`cc-bundled-docs-dts`	B−	Function-child ✓, `closeOnClear` ✓, but imports from root.
`cc-bundled-docs-skill`	A−	All the right pieces; self-closing `Clear` is debatable but valid.
`cc-inline-dts-agents-md`	B+	Subpath ✓, `closeOnClear` ✓, `<Portal>` ✓; still `.map`s instead of function-child.

`breaking-change` — `Combobox.Clear` renamed to `Combobox.Reset`

A synthetic post-cutoff rename: <Combobox.Clear> no longer exists, replaced by <Combobox.Reset>. Tests whether the agent notices the part it remembers from training is gone.

Mechanism	Grade	Notes
`cc-baseline`	A−	Picks up `Reset` and function-child, but threads `index` through manually.
`cc-bundled-docs`	B	Wrong import; otherwise correct.
`cc-bundled-docs-agents-md`	A	Gold standard.
`cc-bundled-docs-dts`	C+	Reaches for `Combobox.useFilter({sensitivity: 'base'})` and passes `filter={filter.contains}`; clever but overbuilt, and `.map`s the list. Reads the types and decides to maximize them.
`cc-bundled-docs-skill`	A−	Right shape; self-closing `Reset`/`Trigger` with `aria-label`.
`cc-inline-dts-agents-md`	C	Omits the `items` prop (relies on a `Root<Fruit>` generic alone) and `.map`s the static array — the filtering API is bypassed entirely.

`docs-only-composition` — `<Combobox.RecentSearches>` nested in `<Combobox.Empty>`

A synthetic post-cutoff <Combobox.RecentSearches> part whose composition rule (must nest inside <Combobox.Empty>) is documented in prose only. Not derivable from the .d.ts shape — measures whether the mechanism surfaces docs the types can't.

Mechanism	Grade	Notes
`cc-baseline`	B+	Wrong import, but the composition (`RecentSearches` inside `Empty`, `Empty` outside `List`) is correct.
`cc-bundled-docs`	A−	Subpath ✓, function-child ✓, composition ✓; minor: no `<Portal>`.
`cc-bundled-docs-agents-md`	A	All correct.
`cc-bundled-docs-dts`	D	Goes back to `Combobox.useFilteredItems` with a `FruitItems` wrapper and nests `<Empty>` inside `<List>` — wrong layout. The `.d.ts` preamble doesn't talk about composition, so the agent invents structure.
`cc-bundled-docs-skill`	A	Clean composition with `<Portal>`.
`cc-inline-dts-agents-md`	A	Clean composition with `<Portal>`.

`new-component` — Callout (component the model has no training data for)

A brand-new Callout component that didn't exist at training time. Pure discoverability test — can the agent find a component it has never heard of and use it correctly?

All six mechanisms discover Callout and import from @base-ui/react/callout. The differences are stylistic. Grades cluster around A−/A; cc-baseline lands first by walking .d.ts files alone, which is the most interesting result here: discovery works fine for a new component as long as it has a .d.ts.

Takeaways

cc-bundled-docs-agents-md wins on every fixture. A short workspace AGENTS.md whose only job is to point at docs/react/components/<X>.md reliably routes the agent into prose docs first, and prose docs are where the idiomatic-usage patterns live (function-child <List>, composition rules, etc.).
Types-first channels (bundled-docs-dts, inline-dts-agents-md) overweight what types can express. They fix subpath imports cleanly via the @packageDocumentation preamble, but they push the agent toward hook-flavoured solutions (useFilteredItems, useFilter) and miss the function-child filtering pattern that only the prose docs teach.
No pointer ≈ wasted channel. cc-bundled-docs ships the same markdown as the winner but, without an AGENTS.md, the agent sometimes wanders into esm/*.js or root package.json before finding it. The pointer is doing most of the work, not the content.
Discovery isn't the bottleneck for new components. Even cc-baseline finds callout/index.d.ts on the first try. Where the matrix actually separates mechanisms is idiomatic API usage on familiar components with recently-changed behaviour.

Reproducing

cd test/evals
pnpm report                           # pass rate / cost / duration matrix
pnpm eval cc-bundled-docs-agents-md   # rerun a single mechanism

Benchmarks previously ran with zero CSS, so paint, positioning, and transition state machinery (data-starting-style, etc.) were not exercised, making results diverge from real-world usage. Copy the hero demo CSS modules into the workspace and apply their class names so the benches hit the same code paths as the docs demos. Also adds a debug script that runs vitest with --ui and a visible browser.

Mechanism × scenario × model harness on `@vercel/agent-eval`. Masquerade is parked in `lib/masquerade.ts` (off by default in `defineExperiment`) pending root-cause of the post-rewrite CLI parse failure.

Replaces the vendor-tarball + file: ref fixture form with a pre-baked install. For each variant, pack.ts spins up an in-process Verdaccio with an npmjs.org uplink, pre-populates storage with the patched @base-ui/react + @base-ui/utils tarballs (skipping the publish auth dance), runs `npm install` in a staging dir, and scrubs the resulting package-lock.json URLs to registry.npmjs.org. The populated node_modules ships as `.deps.tar` (the framework excludes node_modules from fixture upload and dereferences symlinks; a tarball is the only way to preserve .bin/* symlinks). A setup wrapper rehydrates `.deps.tar` into `node_modules/` inside each sandbox before the mechanism setup runs. By the time the agent sees the workspace it's indistinguishable from a fresh `npm install @base-ui/react`: semver-pinned package.json, scrubbed lockfile, populated node_modules, no vendor/, no overrides, no file: refs. Smoke run confirmed zero tar/vendor/.deps/.npmrc/overrides references in any agent transcript. bundled-docs mechanism is parked pending base-ui PR mui#4761; the experiment file is renamed to cc-bundled-docs.ts.parked. Masquerade is deleted (no longer needed). Adds --smoke to scripts/run.ts (tags summary.json so the framework excludes the run from result-reuse) and accepts --flag=value for all flags. Adds a Docker-daemon preflight that fails fast with a clear remediation message when the chosen sandbox backend isn't reachable. scripts/report.ts: drops legacy un-suffixed result dirs (no more stray "?" rows) and pads columns so the matrix is readable in a terminal.

# Conflicts: # pnpm-lock.yaml

PR mui#4761 ships docs/*.md inside @base-ui/react when built with BASE_UI_PUBLISH_DOCS=1. Rather than rebuild the package twice per variant, pack.ts now produces a single shared `.docs-overlay.tar` (staged from docs/public/**/*.md, mapped to node_modules/@base-ui/react/ docs/), copied into every fixture alongside `.deps.tar`. The bundled-docs mechanism extracts the overlay onto the populated node_modules; the rehydrate wrapper cleans both tarballs up after. Default model is now sonnet only (the matrix doesn't need to spend opus budget by default; opt in with `--model opus`). `--model` is the new flag name; `--models` is kept as a legacy alias.

Adds a discoverability ablation across four pointer locations. The current `bundled-docs` arm is kept as the no-pointer control; four new arms each layer a single docs-pointer on top: - bundled-docs-readme overwrites node_modules/@base-ui/react/README.md - bundled-docs-dts prepends a /** @packageDocumentation */ block to node_modules/@base-ui/react/index.d.ts - bundled-docs-agents-md writes an AGENTS.md (+ CLAUDE.md shim) that tells the agent docs live under node_modules/@base-ui/react/docs/ - bundled-docs-skill same as above as a Claude Code skill The published AGENTS.md / SKILL.md assets and packages/react/{README.md, src/index.ts} are intentionally untouched — the pointers ship only as mechanism-time injections so each arm measures the marginal value of one specific hint. Adds a new synthetic eval `docs-only-composition`: a fictional `Combobox.RecentSearches` part whose placement rule (must nest inside `Combobox.Empty`) lives only in the patched bundled docs. Types accept it anywhere; the answer isn't in training data and isn't in types. Existing arms should reliably misplace it; the docs+pointer arms should compose it correctly. The pack pipeline learned per-eval docs overlays: the Patch interface gained an optional `patchDocs(stagedDocsDir)` hook, scripts/pack.ts stages the docs root as a directory at top-level and per-variant copies+patches+tars when the patch opts in. Existing four fixtures keep using the shared overlay tarball.

For new-prop, breaking-change, and new-component the patch mutated the installed package types but left the bundled docs stale. Doc-pointer arms (bundled-docs-readme/dts/agents-md/skill) then read docs that contradicted the patched API, conflating "agent finds the change via docs" with "agent recovers from misleading docs." Add a `patchDocs(stagedDocsDir)` hook to each: new-prop inserts a closeOnClear row + behavior section into combobox.md, breaking-change renames API tokens Clear -> Reset in combobox.md (CSS classes and aria-labels left intact), and new-component writes a synthetic callout.md. Also tidy prettier whitespace across previously-committed evals files.

Sort matrix rows by mean pass rate descending so the strong arms cluster at the top. Add an "Age" column showing relative time since each row's latest run, and dim rows whose latest run is more than 24h older than the newest — old archive data shouldn't be compared apples-to-apples with fresh ablation runs. Colour pass-rate cells green/yellow/red on TTY; non-TTY output stays as plain markdown.

A new mechanism that ships docs as JSDoc preambles on the .d.ts files themselves rather than as a separate markdown overlay. Agents that already read types land on the prose without needing to discover a docs/ directory. Paired with a Next.js-style condensed AGENTS.md (now also used for the existing bundled-docs-agents-md arm) telling the agent the types are the source of truth. Pack pipeline adds a per-eval .inline-dts-overlay.tar containing a copy of @base-ui/react with: a common anatomy JSDoc preamble prepended to combobox/index.parts.d.ts, plus patch-specific JSDoc bumps via a new optional patchInlineDts hook on the Patch interface. breaking- change renames the Clear → Reset reference in the anatomy preamble; docs-only-composition lands the RecentSearches placement rule on both the part's .d.ts and the anatomy preamble. Also skip incomplete fixtures (no tsconfig.json) in the pack loop so stray scratch dirs don't break the bake.

- Drop mechanisms we won't recommend: agents-md, bundled-docs-readme, mcp, skill. Removes assets + experiment files; mechanism union trimmed. - Restructure inline-dts-agents-md: workspace AGENTS.md becomes a pointer; the real guidance (subpath imports, "components are top-level dirs") moves into a package-local AGENTS.md shipped under node_modules/@base-ui/react/, the directory the agent already walks while reading .d.ts files. - Move EVAL.ts to a single canonical evals/_EVAL.ts. Each fixture's EVAL.ts is a byte-identical copy, kept in sync by scripts/sync-evals.ts (--check mode for CI). Per-fixture and shared idiomatic tests gate on test.skipIf. - Add three shared idiomatic graders for the combobox family: function-child <Combobox.List>, items prop on Root, no Combobox.useFilteredItems. Catches "passes the old graders but isn't idiomatic" runs. - Drop the strict subpath-import grader from all EVAL.ts files; npm run build already catches invalid paths and the preference is user-level, not a correctness check. - Strip leaky "use Base UI's combobox/callout" hints from new-component and new-prop PROMPT.md. - Add scripts/first-reads.ts: per-cell matrix showing the first file the agent reads under @base-ui/react. Surfaces which content channel each mechanism actually steers the agent into. - Prefix Mechanism column in report with cc- so the value is copy-pasteable straight into pnpm eval cc-<name>.

pkg-pr-new · 2026-06-09T11:01:12Z

base-ui-tanstack-start

vite-css-base-ui-example

pnpm add https://pkg.pr.new/mui/base-ui/@base-ui/react@4989

pnpm add https://pkg.pr.new/mui/base-ui/@base-ui/utils@4989

commit: e65359d

code-infra-dashboard · 2026-06-09T11:02:44Z

Bundle size

Bundle	Parsed size	Gzip size
@base-ui/react	0B^(0.00%)	0B^(0.00%)

Details of bundle changes

Performance

Total duration: 902.47 ms ▼-288.90 ms^(-24.2%) | Renders: 51 ^(🔺+1) | Paint: 1,345.85 ms ▼-476.48 ms^(-26.1%)

Test	Duration	Renders
Select open (500 options)	68.87 ms 🔺+26.38 ms^(+62.1%)	15 ^(🔺+1)
~~Tabs mount (200 instances)~~ (removed)	—	—
~~Slider mount (300 instances)~~ (removed)	—	—
~~Scroll Area mount (300 instances)~~ (removed)	—	—
~~Popover mount (300 instances)~~ (removed)	—	—

…and 4 more (+7 within noise) — details

Check out the code infra dashboard for more information about this PR.

netlify · 2026-06-09T11:03:41Z

✅ Deploy Preview for base-ui ready!

Name	Link
🔨 Latest commit	`e65359d`
🔍 Latest deploy log	https://app.netlify.com/projects/base-ui/deploys/6a27f21e3856110008848657
😎 Deploy Preview	https://deploy-preview-4989--base-ui.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.
🤖 Make changes	Run an agent on this branch

To edit notification comments on pull requests, go to your Netlify project configuration.

Janpot added 17 commits May 8, 2026 12:15

[code-infra] Add performance benchmark

7a3c011

remove

6b42246

reduce counts

cd4b084

Empty commit

329b156

Empty commit

c1ddc77

fix name

c81f928

Update package.json

7655923

[evals] Checkpoint working agent-knowledge eval matrix

9491c23

Mechanism × scenario × model harness on `@vercel/agent-eval`. Masquerade is parked in `lib/masquerade.ts` (off by default in `defineExperiment`) pending root-cause of the post-rewrite CLI parse failure.

Merge remote-tracking branch 'upstream/master' into evals/checkpoint

c7627e7

# Conflicts: # pnpm-lock.yaml

Janpot changed the title ~~[evals] Tighten matrix and share EVAL.ts across fixtures~~ [evals] Grade knowledge-injection mechanisms on output quality Jun 9, 2026

github-actions Bot added the PR: out-of-date The pull request has merge conflicts and can't be merged. label Jun 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[evals] Grade knowledge-injection mechanisms on output quality#4989

[evals] Grade knowledge-injection mechanisms on output quality#4989
Janpot wants to merge 17 commits into
mui:masterfrom
Janpot:evals/checkpoint

Janpot commented Jun 9, 2026 •

edited

Loading

Uh oh!

pkg-pr-new Bot commented Jun 9, 2026

Uh oh!

code-infra-dashboard Bot commented Jun 9, 2026 •

edited

Loading

Uh oh!

netlify Bot commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Janpot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How to read the eval

LLM grading of App.tsx outputs

combobox — basic fruit picker

new-prop — <Combobox.Clear closeOnClear>

breaking-change — Combobox.Clear renamed to Combobox.Reset

docs-only-composition — <Combobox.RecentSearches> nested in <Combobox.Empty>

new-component — Callout (component the model has no training data for)

Takeaways

Reproducing

Uh oh!

pkg-pr-new Bot commented Jun 9, 2026

Uh oh!

code-infra-dashboard Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Bundle size

Performance

Uh oh!

netlify Bot commented Jun 9, 2026

✅ Deploy Preview for base-ui ready!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Janpot commented Jun 9, 2026 •

edited

Loading

LLM grading of `App.tsx` outputs

`combobox` — basic fruit picker

`new-prop` — `<Combobox.Clear closeOnClear>`

`breaking-change` — `Combobox.Clear` renamed to `Combobox.Reset`

`docs-only-composition` — `<Combobox.RecentSearches>` nested in `<Combobox.Empty>`

`new-component` — Callout (component the model has no training data for)

code-infra-dashboard Bot commented Jun 9, 2026 •

edited

Loading