Skip to content

[evals] Grade knowledge-injection mechanisms on output quality#4989

Draft
Janpot wants to merge 17 commits into
mui:masterfrom
Janpot:evals/checkpoint
Draft

[evals] Grade knowledge-injection mechanisms on output quality#4989
Janpot wants to merge 17 commits into
mui:masterfrom
Janpot:evals/checkpoint

Conversation

@Janpot

@Janpot Janpot commented Jun 9, 2026

Copy link
Copy Markdown
Member

Checkpoint of the agent-knowledge eval matrix at test/evals/. Six knowledge-injection mechanisms × five fixtures × Claude sonnet, each producing an App.tsx that gets graded both deterministically and qualitatively.

Note

  • The matrix was run once per cell to keep cost and runtime down. For conclusive results we'd want N runs per cell plus an LLM-as-judge grading pass; treat the grades below as preliminary. They do match prior expectations though — a workspace AGENTS.md pointing at prose docs is the strongest channel.
  • Not intended to merge. This branch is an exploration; the results inform what we recommend, but the harness and fixtures stay out of master.

How to read the eval

Each fixture asks the agent to build a small Base UI app under conditions the model was not trained on (a renamed part, a new prop, a brand-new component, prose-only composition). The "mechanism" controls what knowledge channel the sandbox exposes:

  • cc-baseline — nothing extra; the model relies on its training and on reading .d.ts files in node_modules.
  • cc-bundled-docsdocs/react/components/*.md shipped inside the package; no pointer.
  • cc-bundled-docs-agents-md — bundled docs + a workspace AGENTS.md pointing at them.
  • cc-bundled-docs-dts — bundled docs + a @packageDocumentation preamble on index.d.ts.
  • cc-bundled-docs-skill — bundled docs + a SKILL.md advertised in the workspace.
  • cc-inline-dts-agents-md — anatomy JSDoc inlined into the .d.ts files + a workspace AGENTS.md pointing at a package-local AGENTS.md shipped alongside the types.

LLM grading of App.tsx outputs

Letter grades reflect how close the output is to idiomatic 1.5 Combobox usage: subpath import, items prop on Root, function-child <List> for filtering, <Portal> layering, correct Empty/List/RecentSearches composition, and the per-fixture API the prompt expects.

combobox — basic fruit picker

Sanity check. A vanilla fruit picker using only APIs that exist in the model's training data — measures baseline fluency with no post-cutoff surprises.

Mechanism Grade Notes
cc-baseline D Root import; hand-rolls a FruitList wrapper around Combobox.useFilteredItems (internal hook) and .maps items manually. Wrong API on every axis.
cc-bundled-docs B− Function-child <List> ✓, generic Root<Fruit> ✓, but imports from @base-ui/react (no subpath). No <Portal>.
cc-bundled-docs-agents-md A Subpath import, items prop, function-child, <Portal>, <ItemIndicator>, Empty before List. Reads like a hand-tuned demo.
cc-bundled-docs-dts A− Same shape as above; skips Label and self-closes Clear/Trigger with aria-label.
cc-bundled-docs-skill B Subpath ✓, items ✓, <Portal> ✓ — but .map inside <List> instead of function-child.
cc-inline-dts-agents-md B Same trade-off as skill: structure is right, but it .maps items rather than using the function-child filtering API.

new-prop<Combobox.Clear closeOnClear>

Tests discovery of a synthetic post-cutoff closeOnClear prop on <Combobox.Clear>. The model can't have seen this prop; it has to find it in the injected channel.

Mechanism Grade Notes
cc-baseline C Root import; .map inside <List>; missing <Portal>. Gets closeOnClear from training.
cc-bundled-docs C+ Wrong import; function-child ✓; over-engineers a custom filter={…} that duplicates what the items API already does.
cc-bundled-docs-agents-md A Clean: subpath, function-child, <Portal>, closeOnClear, <ItemIndicator>.
cc-bundled-docs-dts B− Function-child ✓, closeOnClear ✓, but imports from root.
cc-bundled-docs-skill A− All the right pieces; self-closing Clear is debatable but valid.
cc-inline-dts-agents-md B+ Subpath ✓, closeOnClear ✓, <Portal> ✓; still .maps instead of function-child.

breaking-changeCombobox.Clear renamed to Combobox.Reset

A synthetic post-cutoff rename: <Combobox.Clear> no longer exists, replaced by <Combobox.Reset>. Tests whether the agent notices the part it remembers from training is gone.

Mechanism Grade Notes
cc-baseline A− Picks up Reset and function-child, but threads index through manually.
cc-bundled-docs B Wrong import; otherwise correct.
cc-bundled-docs-agents-md A Gold standard.
cc-bundled-docs-dts C+ Reaches for Combobox.useFilter({sensitivity: 'base'}) and passes filter={filter.contains}; clever but overbuilt, and .maps the list. Reads the types and decides to maximize them.
cc-bundled-docs-skill A− Right shape; self-closing Reset/Trigger with aria-label.
cc-inline-dts-agents-md C Omits the items prop (relies on a Root<Fruit> generic alone) and .maps the static array — the filtering API is bypassed entirely.

docs-only-composition<Combobox.RecentSearches> nested in <Combobox.Empty>

A synthetic post-cutoff <Combobox.RecentSearches> part whose composition rule (must nest inside <Combobox.Empty>) is documented in prose only. Not derivable from the .d.ts shape — measures whether the mechanism surfaces docs the types can't.

Mechanism Grade Notes
cc-baseline B+ Wrong import, but the composition (RecentSearches inside Empty, Empty outside List) is correct.
cc-bundled-docs A− Subpath ✓, function-child ✓, composition ✓; minor: no <Portal>.
cc-bundled-docs-agents-md A All correct.
cc-bundled-docs-dts D Goes back to Combobox.useFilteredItems with a FruitItems wrapper and nests <Empty> inside <List> — wrong layout. The .d.ts preamble doesn't talk about composition, so the agent invents structure.
cc-bundled-docs-skill A Clean composition with <Portal>.
cc-inline-dts-agents-md A Clean composition with <Portal>.

new-component — Callout (component the model has no training data for)

A brand-new Callout component that didn't exist at training time. Pure discoverability test — can the agent find a component it has never heard of and use it correctly?

All six mechanisms discover Callout and import from @base-ui/react/callout. The differences are stylistic. Grades cluster around A−/A; cc-baseline lands first by walking .d.ts files alone, which is the most interesting result here: discovery works fine for a new component as long as it has a .d.ts.

Takeaways

  • cc-bundled-docs-agents-md wins on every fixture. A short workspace AGENTS.md whose only job is to point at docs/react/components/<X>.md reliably routes the agent into prose docs first, and prose docs are where the idiomatic-usage patterns live (function-child <List>, composition rules, etc.).
  • Types-first channels (bundled-docs-dts, inline-dts-agents-md) overweight what types can express. They fix subpath imports cleanly via the @packageDocumentation preamble, but they push the agent toward hook-flavoured solutions (useFilteredItems, useFilter) and miss the function-child filtering pattern that only the prose docs teach.
  • No pointer ≈ wasted channel. cc-bundled-docs ships the same markdown as the winner but, without an AGENTS.md, the agent sometimes wanders into esm/*.js or root package.json before finding it. The pointer is doing most of the work, not the content.
  • Discovery isn't the bottleneck for new components. Even cc-baseline finds callout/index.d.ts on the first try. Where the matrix actually separates mechanisms is idiomatic API usage on familiar components with recently-changed behaviour.

Reproducing

cd test/evals
pnpm report                           # pass rate / cost / duration matrix
pnpm eval cc-bundled-docs-agents-md   # rerun a single mechanism

Janpot added 17 commits May 8, 2026 12:15
Benchmarks previously ran with zero CSS, so paint, positioning, and
transition state machinery (data-starting-style, etc.) were not
exercised, making results diverge from real-world usage. Copy the hero
demo CSS modules into the workspace and apply their class names so the
benches hit the same code paths as the docs demos. Also adds a debug
script that runs vitest with --ui and a visible browser.
Mechanism × scenario × model harness on `@vercel/agent-eval`. Masquerade
is parked in `lib/masquerade.ts` (off by default in `defineExperiment`)
pending root-cause of the post-rewrite CLI parse failure.
Replaces the vendor-tarball + file: ref fixture form with a pre-baked
install. For each variant, pack.ts spins up an in-process Verdaccio with
an npmjs.org uplink, pre-populates storage with the patched
@base-ui/react + @base-ui/utils tarballs (skipping the publish auth
dance), runs `npm install` in a staging dir, and scrubs the resulting
package-lock.json URLs to registry.npmjs.org. The populated node_modules
ships as `.deps.tar` (the framework excludes node_modules from fixture
upload and dereferences symlinks; a tarball is the only way to preserve
.bin/* symlinks). A setup wrapper rehydrates `.deps.tar` into
`node_modules/` inside each sandbox before the mechanism setup runs.

By the time the agent sees the workspace it's indistinguishable from a
fresh `npm install @base-ui/react`: semver-pinned package.json, scrubbed
lockfile, populated node_modules, no vendor/, no overrides, no file:
refs. Smoke run confirmed zero tar/vendor/.deps/.npmrc/overrides
references in any agent transcript.

bundled-docs mechanism is parked pending base-ui PR mui#4761; the
experiment file is renamed to cc-bundled-docs.ts.parked. Masquerade is
deleted (no longer needed).

Adds --smoke to scripts/run.ts (tags summary.json so the framework
excludes the run from result-reuse) and accepts --flag=value for all
flags. Adds a Docker-daemon preflight that fails fast with a clear
remediation message when the chosen sandbox backend isn't reachable.

scripts/report.ts: drops legacy un-suffixed result dirs (no more stray
"?" rows) and pads columns so the matrix is readable in a terminal.
PR mui#4761 ships docs/*.md inside @base-ui/react when built with
BASE_UI_PUBLISH_DOCS=1. Rather than rebuild the package twice per
variant, pack.ts now produces a single shared `.docs-overlay.tar`
(staged from docs/public/**/*.md, mapped to node_modules/@base-ui/react/
docs/), copied into every fixture alongside `.deps.tar`. The
bundled-docs mechanism extracts the overlay onto the populated
node_modules; the rehydrate wrapper cleans both tarballs up after.

Default model is now sonnet only (the matrix doesn't need to spend opus
budget by default; opt in with `--model opus`). `--model` is the new
flag name; `--models` is kept as a legacy alias.
Adds a discoverability ablation across four pointer locations. The
current `bundled-docs` arm is kept as the no-pointer control; four new
arms each layer a single docs-pointer on top:

  - bundled-docs-readme       overwrites node_modules/@base-ui/react/README.md
  - bundled-docs-dts          prepends a /** @packageDocumentation */
                              block to node_modules/@base-ui/react/index.d.ts
  - bundled-docs-agents-md    writes an AGENTS.md (+ CLAUDE.md shim) that
                              tells the agent docs live under
                              node_modules/@base-ui/react/docs/
  - bundled-docs-skill        same as above as a Claude Code skill

The published AGENTS.md / SKILL.md assets and packages/react/{README.md,
src/index.ts} are intentionally untouched — the pointers ship only as
mechanism-time injections so each arm measures the marginal value of
one specific hint.

Adds a new synthetic eval `docs-only-composition`: a fictional
`Combobox.RecentSearches` part whose placement rule (must nest inside
`Combobox.Empty`) lives only in the patched bundled docs. Types accept
it anywhere; the answer isn't in training data and isn't in types.
Existing arms should reliably misplace it; the docs+pointer arms
should compose it correctly.

The pack pipeline learned per-eval docs overlays: the Patch interface
gained an optional `patchDocs(stagedDocsDir)` hook, scripts/pack.ts
stages the docs root as a directory at top-level and per-variant
copies+patches+tars when the patch opts in. Existing four fixtures
keep using the shared overlay tarball.
For new-prop, breaking-change, and new-component the patch mutated the
installed package types but left the bundled docs stale. Doc-pointer
arms (bundled-docs-readme/dts/agents-md/skill) then read docs that
contradicted the patched API, conflating "agent finds the change via
docs" with "agent recovers from misleading docs."

Add a `patchDocs(stagedDocsDir)` hook to each: new-prop inserts a
closeOnClear row + behavior section into combobox.md, breaking-change
renames API tokens Clear -> Reset in combobox.md (CSS classes and
aria-labels left intact), and new-component writes a synthetic
callout.md. Also tidy prettier whitespace across previously-committed
evals files.
Sort matrix rows by mean pass rate descending so the strong arms cluster
at the top. Add an "Age" column showing relative time since each row's
latest run, and dim rows whose latest run is more than 24h older than
the newest — old archive data shouldn't be compared apples-to-apples
with fresh ablation runs. Colour pass-rate cells green/yellow/red on
TTY; non-TTY output stays as plain markdown.
A new mechanism that ships docs as JSDoc preambles on the .d.ts files
themselves rather than as a separate markdown overlay. Agents that
already read types land on the prose without needing to discover a
docs/ directory. Paired with a Next.js-style condensed AGENTS.md (now
also used for the existing bundled-docs-agents-md arm) telling the
agent the types are the source of truth.

Pack pipeline adds a per-eval .inline-dts-overlay.tar containing a
copy of @base-ui/react with: a common anatomy JSDoc preamble prepended
to combobox/index.parts.d.ts, plus patch-specific JSDoc bumps via a
new optional patchInlineDts hook on the Patch interface. breaking-
change renames the Clear → Reset reference in the anatomy preamble;
docs-only-composition lands the RecentSearches placement rule on both
the part's .d.ts and the anatomy preamble.

Also skip incomplete fixtures (no tsconfig.json) in the pack loop so
stray scratch dirs don't break the bake.
- Drop mechanisms we won't recommend: agents-md, bundled-docs-readme,
  mcp, skill. Removes assets + experiment files; mechanism union
  trimmed.
- Restructure inline-dts-agents-md: workspace AGENTS.md becomes a
  pointer; the real guidance (subpath imports, "components are
  top-level dirs") moves into a package-local AGENTS.md shipped
  under node_modules/@base-ui/react/, the directory the agent
  already walks while reading .d.ts files.
- Move EVAL.ts to a single canonical evals/_EVAL.ts. Each fixture's
  EVAL.ts is a byte-identical copy, kept in sync by
  scripts/sync-evals.ts (--check mode for CI). Per-fixture and
  shared idiomatic tests gate on test.skipIf.
- Add three shared idiomatic graders for the combobox family:
  function-child <Combobox.List>, items prop on Root, no
  Combobox.useFilteredItems. Catches "passes the old graders but
  isn't idiomatic" runs.
- Drop the strict subpath-import grader from all EVAL.ts files;
  npm run build already catches invalid paths and the preference
  is user-level, not a correctness check.
- Strip leaky "use Base UI's combobox/callout" hints from
  new-component and new-prop PROMPT.md.
- Add scripts/first-reads.ts: per-cell matrix showing the first
  file the agent reads under @base-ui/react. Surfaces which
  content channel each mechanism actually steers the agent into.
- Prefix Mechanism column in report with cc- so the value is
  copy-pasteable straight into pnpm eval cc-<name>.
@pkg-pr-new

pkg-pr-new Bot commented Jun 9, 2026

Copy link
Copy Markdown

commit: e65359d

@code-infra-dashboard

code-infra-dashboard Bot commented Jun 9, 2026

Copy link
Copy Markdown

Bundle size

Bundle Parsed size Gzip size
@base-ui/react 0B(0.00%) 0B(0.00%)

Details of bundle changes

Performance

Total duration: 902.47 ms ▼-288.90 ms(-24.2%) | Renders: 51 (🔺+1) | Paint: 1,345.85 ms ▼-476.48 ms(-26.1%)

Test Duration Renders
Select open (500 options) 68.87 ms 🔺+26.38 ms(+62.1%) 15 (🔺+1)
Tabs mount (200 instances) (removed)
Slider mount (300 instances) (removed)
Scroll Area mount (300 instances) (removed)
Popover mount (300 instances) (removed)

…and 4 more (+7 within noise) — details


Check out the code infra dashboard for more information about this PR.

@netlify

netlify Bot commented Jun 9, 2026

Copy link
Copy Markdown

Deploy Preview for base-ui ready!

Name Link
🔨 Latest commit e65359d
🔍 Latest deploy log https://app.netlify.com/projects/base-ui/deploys/6a27f21e3856110008848657
😎 Deploy Preview https://deploy-preview-4989--base-ui.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.
🤖 Make changes Run an agent on this branch

To edit notification comments on pull requests, go to your Netlify project configuration.

@Janpot Janpot changed the title [evals] Tighten matrix and share EVAL.ts across fixtures [evals] Grade knowledge-injection mechanisms on output quality Jun 9, 2026
@github-actions github-actions Bot added the PR: out-of-date The pull request has merge conflicts and can't be merged. label Jun 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

PR: out-of-date The pull request has merge conflicts and can't be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant