[evals] Grade knowledge-injection mechanisms on output quality#4989
Draft
Janpot wants to merge 17 commits into
Draft
[evals] Grade knowledge-injection mechanisms on output quality#4989Janpot wants to merge 17 commits into
Janpot wants to merge 17 commits into
Conversation
Benchmarks previously ran with zero CSS, so paint, positioning, and transition state machinery (data-starting-style, etc.) were not exercised, making results diverge from real-world usage. Copy the hero demo CSS modules into the workspace and apply their class names so the benches hit the same code paths as the docs demos. Also adds a debug script that runs vitest with --ui and a visible browser.
Mechanism × scenario × model harness on `@vercel/agent-eval`. Masquerade is parked in `lib/masquerade.ts` (off by default in `defineExperiment`) pending root-cause of the post-rewrite CLI parse failure.
Replaces the vendor-tarball + file: ref fixture form with a pre-baked install. For each variant, pack.ts spins up an in-process Verdaccio with an npmjs.org uplink, pre-populates storage with the patched @base-ui/react + @base-ui/utils tarballs (skipping the publish auth dance), runs `npm install` in a staging dir, and scrubs the resulting package-lock.json URLs to registry.npmjs.org. The populated node_modules ships as `.deps.tar` (the framework excludes node_modules from fixture upload and dereferences symlinks; a tarball is the only way to preserve .bin/* symlinks). A setup wrapper rehydrates `.deps.tar` into `node_modules/` inside each sandbox before the mechanism setup runs. By the time the agent sees the workspace it's indistinguishable from a fresh `npm install @base-ui/react`: semver-pinned package.json, scrubbed lockfile, populated node_modules, no vendor/, no overrides, no file: refs. Smoke run confirmed zero tar/vendor/.deps/.npmrc/overrides references in any agent transcript. bundled-docs mechanism is parked pending base-ui PR mui#4761; the experiment file is renamed to cc-bundled-docs.ts.parked. Masquerade is deleted (no longer needed). Adds --smoke to scripts/run.ts (tags summary.json so the framework excludes the run from result-reuse) and accepts --flag=value for all flags. Adds a Docker-daemon preflight that fails fast with a clear remediation message when the chosen sandbox backend isn't reachable. scripts/report.ts: drops legacy un-suffixed result dirs (no more stray "?" rows) and pads columns so the matrix is readable in a terminal.
# Conflicts: # pnpm-lock.yaml
PR mui#4761 ships docs/*.md inside @base-ui/react when built with BASE_UI_PUBLISH_DOCS=1. Rather than rebuild the package twice per variant, pack.ts now produces a single shared `.docs-overlay.tar` (staged from docs/public/**/*.md, mapped to node_modules/@base-ui/react/ docs/), copied into every fixture alongside `.deps.tar`. The bundled-docs mechanism extracts the overlay onto the populated node_modules; the rehydrate wrapper cleans both tarballs up after. Default model is now sonnet only (the matrix doesn't need to spend opus budget by default; opt in with `--model opus`). `--model` is the new flag name; `--models` is kept as a legacy alias.
Adds a discoverability ablation across four pointer locations. The
current `bundled-docs` arm is kept as the no-pointer control; four new
arms each layer a single docs-pointer on top:
- bundled-docs-readme overwrites node_modules/@base-ui/react/README.md
- bundled-docs-dts prepends a /** @packageDocumentation */
block to node_modules/@base-ui/react/index.d.ts
- bundled-docs-agents-md writes an AGENTS.md (+ CLAUDE.md shim) that
tells the agent docs live under
node_modules/@base-ui/react/docs/
- bundled-docs-skill same as above as a Claude Code skill
The published AGENTS.md / SKILL.md assets and packages/react/{README.md,
src/index.ts} are intentionally untouched — the pointers ship only as
mechanism-time injections so each arm measures the marginal value of
one specific hint.
Adds a new synthetic eval `docs-only-composition`: a fictional
`Combobox.RecentSearches` part whose placement rule (must nest inside
`Combobox.Empty`) lives only in the patched bundled docs. Types accept
it anywhere; the answer isn't in training data and isn't in types.
Existing arms should reliably misplace it; the docs+pointer arms
should compose it correctly.
The pack pipeline learned per-eval docs overlays: the Patch interface
gained an optional `patchDocs(stagedDocsDir)` hook, scripts/pack.ts
stages the docs root as a directory at top-level and per-variant
copies+patches+tars when the patch opts in. Existing four fixtures
keep using the shared overlay tarball.
For new-prop, breaking-change, and new-component the patch mutated the installed package types but left the bundled docs stale. Doc-pointer arms (bundled-docs-readme/dts/agents-md/skill) then read docs that contradicted the patched API, conflating "agent finds the change via docs" with "agent recovers from misleading docs." Add a `patchDocs(stagedDocsDir)` hook to each: new-prop inserts a closeOnClear row + behavior section into combobox.md, breaking-change renames API tokens Clear -> Reset in combobox.md (CSS classes and aria-labels left intact), and new-component writes a synthetic callout.md. Also tidy prettier whitespace across previously-committed evals files.
Sort matrix rows by mean pass rate descending so the strong arms cluster at the top. Add an "Age" column showing relative time since each row's latest run, and dim rows whose latest run is more than 24h older than the newest — old archive data shouldn't be compared apples-to-apples with fresh ablation runs. Colour pass-rate cells green/yellow/red on TTY; non-TTY output stays as plain markdown.
A new mechanism that ships docs as JSDoc preambles on the .d.ts files themselves rather than as a separate markdown overlay. Agents that already read types land on the prose without needing to discover a docs/ directory. Paired with a Next.js-style condensed AGENTS.md (now also used for the existing bundled-docs-agents-md arm) telling the agent the types are the source of truth. Pack pipeline adds a per-eval .inline-dts-overlay.tar containing a copy of @base-ui/react with: a common anatomy JSDoc preamble prepended to combobox/index.parts.d.ts, plus patch-specific JSDoc bumps via a new optional patchInlineDts hook on the Patch interface. breaking- change renames the Clear → Reset reference in the anatomy preamble; docs-only-composition lands the RecentSearches placement rule on both the part's .d.ts and the anatomy preamble. Also skip incomplete fixtures (no tsconfig.json) in the pack loop so stray scratch dirs don't break the bake.
- Drop mechanisms we won't recommend: agents-md, bundled-docs-readme, mcp, skill. Removes assets + experiment files; mechanism union trimmed. - Restructure inline-dts-agents-md: workspace AGENTS.md becomes a pointer; the real guidance (subpath imports, "components are top-level dirs") moves into a package-local AGENTS.md shipped under node_modules/@base-ui/react/, the directory the agent already walks while reading .d.ts files. - Move EVAL.ts to a single canonical evals/_EVAL.ts. Each fixture's EVAL.ts is a byte-identical copy, kept in sync by scripts/sync-evals.ts (--check mode for CI). Per-fixture and shared idiomatic tests gate on test.skipIf. - Add three shared idiomatic graders for the combobox family: function-child <Combobox.List>, items prop on Root, no Combobox.useFilteredItems. Catches "passes the old graders but isn't idiomatic" runs. - Drop the strict subpath-import grader from all EVAL.ts files; npm run build already catches invalid paths and the preference is user-level, not a correctness check. - Strip leaky "use Base UI's combobox/callout" hints from new-component and new-prop PROMPT.md. - Add scripts/first-reads.ts: per-cell matrix showing the first file the agent reads under @base-ui/react. Surfaces which content channel each mechanism actually steers the agent into. - Prefix Mechanism column in report with cc- so the value is copy-pasteable straight into pnpm eval cc-<name>.
commit: |
Bundle size
PerformanceTotal duration: 902.47 ms ▼-288.90 ms(-24.2%) | Renders: 51 (🔺+1) | Paint: 1,345.85 ms ▼-476.48 ms(-26.1%)
…and 4 more (+7 within noise) — details Check out the code infra dashboard for more information about this PR. |
✅ Deploy Preview for base-ui ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Checkpoint of the agent-knowledge eval matrix at
test/evals/. Six knowledge-injection mechanisms × five fixtures × Claude sonnet, each producing anApp.tsxthat gets graded both deterministically and qualitatively.Note
AGENTS.mdpointing at prose docs is the strongest channel.master.How to read the eval
Each fixture asks the agent to build a small Base UI app under conditions the model was not trained on (a renamed part, a new prop, a brand-new component, prose-only composition). The "mechanism" controls what knowledge channel the sandbox exposes:
cc-baseline— nothing extra; the model relies on its training and on reading.d.tsfiles innode_modules.cc-bundled-docs—docs/react/components/*.mdshipped inside the package; no pointer.cc-bundled-docs-agents-md— bundled docs + a workspaceAGENTS.mdpointing at them.cc-bundled-docs-dts— bundled docs + a@packageDocumentationpreamble onindex.d.ts.cc-bundled-docs-skill— bundled docs + aSKILL.mdadvertised in the workspace.cc-inline-dts-agents-md— anatomy JSDoc inlined into the.d.tsfiles + a workspaceAGENTS.mdpointing at a package-localAGENTS.mdshipped alongside the types.LLM grading of
App.tsxoutputsLetter grades reflect how close the output is to idiomatic 1.5 Combobox usage: subpath import,
itemsprop on Root, function-child<List>for filtering,<Portal>layering, correctEmpty/List/RecentSearchescomposition, and the per-fixture API the prompt expects.combobox— basic fruit pickerSanity check. A vanilla fruit picker using only APIs that exist in the model's training data — measures baseline fluency with no post-cutoff surprises.
cc-baselineFruitListwrapper aroundCombobox.useFilteredItems(internal hook) and.maps items manually. Wrong API on every axis.cc-bundled-docs<List>✓, genericRoot<Fruit>✓, but imports from@base-ui/react(no subpath). No<Portal>.cc-bundled-docs-agents-mditemsprop, function-child,<Portal>,<ItemIndicator>,EmptybeforeList. Reads like a hand-tuned demo.cc-bundled-docs-dtsLabeland self-closesClear/Triggerwitharia-label.cc-bundled-docs-skillitems✓,<Portal>✓ — but.mapinside<List>instead of function-child.cc-inline-dts-agents-mdskill: structure is right, but it.maps items rather than using the function-child filtering API.new-prop—<Combobox.Clear closeOnClear>Tests discovery of a synthetic post-cutoff
closeOnClearprop on<Combobox.Clear>. The model can't have seen this prop; it has to find it in the injected channel.cc-baseline.mapinside<List>; missing<Portal>. GetscloseOnClearfrom training.cc-bundled-docsfilter={…}that duplicates what the items API already does.cc-bundled-docs-agents-md<Portal>,closeOnClear,<ItemIndicator>.cc-bundled-docs-dtscloseOnClear✓, but imports from root.cc-bundled-docs-skillClearis debatable but valid.cc-inline-dts-agents-mdcloseOnClear✓,<Portal>✓; still.maps instead of function-child.breaking-change—Combobox.Clearrenamed toCombobox.ResetA synthetic post-cutoff rename:
<Combobox.Clear>no longer exists, replaced by<Combobox.Reset>. Tests whether the agent notices the part it remembers from training is gone.cc-baselineResetand function-child, but threadsindexthrough manually.cc-bundled-docscc-bundled-docs-agents-mdcc-bundled-docs-dtsCombobox.useFilter({sensitivity: 'base'})and passesfilter={filter.contains}; clever but overbuilt, and.maps the list. Reads the types and decides to maximize them.cc-bundled-docs-skillReset/Triggerwitharia-label.cc-inline-dts-agents-mditemsprop (relies on aRoot<Fruit>generic alone) and.maps the static array — the filtering API is bypassed entirely.docs-only-composition—<Combobox.RecentSearches>nested in<Combobox.Empty>A synthetic post-cutoff
<Combobox.RecentSearches>part whose composition rule (must nest inside<Combobox.Empty>) is documented in prose only. Not derivable from the.d.tsshape — measures whether the mechanism surfaces docs the types can't.cc-baselineRecentSearchesinsideEmpty,EmptyoutsideList) is correct.cc-bundled-docs<Portal>.cc-bundled-docs-agents-mdcc-bundled-docs-dtsCombobox.useFilteredItemswith aFruitItemswrapper and nests<Empty>inside<List>— wrong layout. The.d.tspreamble doesn't talk about composition, so the agent invents structure.cc-bundled-docs-skill<Portal>.cc-inline-dts-agents-md<Portal>.new-component— Callout (component the model has no training data for)A brand-new
Calloutcomponent that didn't exist at training time. Pure discoverability test — can the agent find a component it has never heard of and use it correctly?All six mechanisms discover Callout and import from
@base-ui/react/callout. The differences are stylistic. Grades cluster around A−/A;cc-baselinelands first by walking.d.tsfiles alone, which is the most interesting result here: discovery works fine for a new component as long as it has a.d.ts.Takeaways
cc-bundled-docs-agents-mdwins on every fixture. A short workspaceAGENTS.mdwhose only job is to point atdocs/react/components/<X>.mdreliably routes the agent into prose docs first, and prose docs are where the idiomatic-usage patterns live (function-child<List>, composition rules, etc.).bundled-docs-dts,inline-dts-agents-md) overweight what types can express. They fix subpath imports cleanly via the@packageDocumentationpreamble, but they push the agent toward hook-flavoured solutions (useFilteredItems,useFilter) and miss the function-child filtering pattern that only the prose docs teach.cc-bundled-docsships the same markdown as the winner but, without anAGENTS.md, the agent sometimes wanders intoesm/*.jsor rootpackage.jsonbefore finding it. The pointer is doing most of the work, not the content.cc-baselinefindscallout/index.d.tson the first try. Where the matrix actually separates mechanisms is idiomatic API usage on familiar components with recently-changed behaviour.Reproducing