[test] Add axe-core tests for mui-material#48341
Conversation
Runs axe-core against real docs demos in Chromium, reusing the VRT Vite server. Enrollment is an explicit list in test/a11y/a11y.test.mjs (POC: Button + Card, 4 demos). Output a11y-results.json carries both per-component aggregates (for the docs widget) and per-demo detail (for future per-demo UI). Supersedes mui#47895 — swaps hand-written scenario modules for the docs demos themselves, per Olivier's suggestion to mirror Janpot's html-validate-over-docs-pages pattern in mui#48088.
Deploy previewhttps://deploy-preview-48341--material-ui.netlify.app/ Bundle size
Check out the code infra dashboard for more information about this PR. |
Replaces the hardcoded ENROLLED map in a11y.test.mjs with a full component roster at packages/mui-material/test/a11y/config.ts. Every component has an entry; status is 'enabled' (Button, Card) or 'pending' (rest). Adoption per PR = flip 'pending' to 'enabled' and fill in `demos`. The config doubles as the rollout ledger — a future docs presentation can iterate the full array to show both enrolled and pending components. Drops ComponentAccessibilityStatus.js and the two .md embeds. Presentation is a separate concern; the JSON shape stays (per-component aggregates + per-demo detail) so the future widget has the data it needs.
Replaces the single a11y-results.json (383 KB / 14k lines at full rollout)
with packages/mui-material/test/a11y/results/{Component}.json. A future
docs widget can import('.../results/Button.json') and webpack/vite will
code-split per component, so each page only ships its own ~2 KB payload
instead of the whole roster.
Also updates the docs:a11y script to rimraf the directory before each
run, and the AGENTS.md reference.
Flips 45 components from 'pending' to 'enabled' in config.ts. The 26 that currently trip color-contrast keep a skipRules: ['color-contrast'] entry so CI stays green while the violations still land in failedRules for the team to triage. a11y.test.mjs now falls back to the VRT nav when a config entry omits its 'demos' list, so enrolling is one line (flip the status). Explicit demos remain the override knob. The 11 components VRT excludes (need interaction, no demos, flaky) stay 'pending' with inline reasons — nothing observable today.
|
Rendering all the demos, one by one, in a real browser and run tests on them. Sounds an awful lot like the current screenshot tests we have. Would it make sense (be possible?) to fold accessibility testing into the regression tests. It would likely save a ton on circleci credits compared to running another full suite where we render all demos. |
`skipAssertions` better describes the behavior — the rule still runs and still lands in the results JSON, only the test-failing assertion is suppressed. `skipRules` read as "rule is skipped entirely", which isn't what happens. Also drops Box from the roster — nothing to a11y-test.
Layout wrappers (Grid, Masonry, Portal, NoSsr, Stack, Paper) and pure behavior components (Backdrop, ClickAwayListener, Container, CssBaseline) have no a11y semantics of their own — axe either finds nothing or attributes violations to them that actually belong to their inner content (already covered by the child component's own entry). Keeps Typography (heading hierarchy), Skeleton (aria-busy), Divider (caught a real list violation) since those contribute real signal. 44 components in roster → 36 enabled, 8 pending (interaction-only).
The status gate was infrastructure for a presentation UI that isn't in
scope for this PR. The config now lists only components that are actually
enrolled; pending ones live as `// TODO: {Component} — <blocker>` comments
so the reason stays visible without needing a parseable data shape.
If a future docs widget ever wants a "coming soon" list, convert the
comments back to structured entries then.
Infra is already shared — Kept it as a separate Vitest file for a standalone dev loop, not because of the infra cost. Happy to fold axe into Do you think saving around ~40s is a hard constraint? Happy to follow your judgement. |
It'd cost about 8 euro per month, so cost-wise probably not. It'd increase the current job time (6m 12s/last 30 days) to ~7m, making it the second longest running job, after the type checking. That technically doesn't make it the bottle neck. The bottle neck currently is type checking, we plan to migrate to typescript native soon, so in not so long the |
|
I pushed a commit to move to regression test, the timing does not differ much but the code is simplified. Net CI wall-clock saving is ~break-even; real wins are ~16 s of duplicate Chromium setup avoided per run and infra simplification (one Vitest suite, one config, one script deleted). No new violations; 2 of 38 result files differ (all pre-existing false-positive |
Per Jan's review on mui#48341: run axe inline in test/regressions/index.test.js, reuse the same Chromium session. Drop the standalone test/a11y/ suite and the docs:a11y:dev script. CI a11y gate moves from test_browser to test_regressions. Slider/Table results reflect a cleaner DOM (mouse parked, scrollbar normalized); all deltas are false-positive color-contrast flags disappearing.
|
Ok, if the win in runtime is marginal then we can go either way, your call. Did you check how currently screenshots are being excluded? Because if you implement this as enrollment in the existing screenshots fixtures, you will also inherit its exclusion mechanism. If we go the combined route then we should probably import all demos and build separate exclusion mechanisms for a11y and screenshots. e.g. declare const demoMeta: Map<string, {
screenshot: {
enabled: boolean,
// extra options, e.g. waitForSelector...
},
a11y: {
enabled: boolean,
rules: {
// ...
}
}
}>()aside: Thanks for editing the LLM generated comment, it really demotivates me when I have to answer these. Feels like I'm putting in the effort the author should have done when interacting with their LLM. |
There was a problem hiding this comment.
Just two things,
- My edited comment in #48341 (comment) about screenshot exclusion already being encoded in how we load the test fixtures, is it a problem?
- We probably want to generate the data nested under the ./docs folder somewhere, just like we do for other docs generated data?
Sorry for that, I usually don't want LLM to write response directly on GH but it was trying to be smart and push the comment directly 🥲. I'm building guard rail around it, bear with me. |
Per-tool exclusions now live in test/regressions/demoMeta.ts instead of glob negations in index.jsx. Screenshot-specific reasons (Redundant, Flaky image loading) no longer silently drop a11y coverage; autocomplete Redundant variants, skeleton/Facebook, snackbars/Customized, steppers image-heavy demos, and AccessibleTabs1/2 now get axe coverage. Glob keeps only structural exclusions + whole slugs no tool consumes. DEMO_META accepts both demo-level and slug-level keys; demo wins when both exist. shouldScreenshot + resolveA11y are the test-runner gates. Unit tests in demoMeta.test.ts cover current shape + the 3-line workflow for enabling a11y on an interaction-heavy slug (e.g. dialogs). a11yConfig.ts removed; AGENTS.md updated.
|
Pushed f337bc7 implementing the separation. The regression exclusion can be separately configurable in |
Colocates the generated JSON with other docs data, per Janpot's second review point. Reporter's OUT_DIR and the test:regressions prettier step updated to the new path. No other consumers touched the old path.
Migration expanded axe coverage to demos previously screenshot-excluded (autocomplete Redundant, skeleton/Facebook, etc.). Surfaces color-contrast near-threshold findings on fab, skeleton, bottom-navigation — same class of issue already suppressed elsewhere. Added skipAssertions for the three slugs and regenerated results JSONs at docs/data/material/a11y/.
just the infra (with couple of Button demos generated). |
Address Codex review on mui#48341: waitForSelector field on ScreenshotRule was declared but never read. Migrate the hardcoded ReactVirtualizedTable wait into a rule and apply screenshotRule.waitForSelector before axe + screenshot in the VRT loop. Reporter now prunes stale {slug}.a11y.json after writing — gated on the VRT module having actually run and no -t filter, so unrelated unit-test runs and filtered slug subsets leave untouched files alone.
nice catch, fixed both of them. |
| } | ||
| } | ||
|
|
||
| const violations = [...new Set([...results.violations, ...results.incomplete].map((v) => v.id))]; |
There was a problem hiding this comment.
I don't think incomplete should be treated as (merged into) violations here
It generally means axe-core can't evaluate this and "needs human review": https://docs.deque.com/devtools-for-web/4/en/needs-review-incomplete
IMO these are the most important ones to track as they are the automation gaps
There was a problem hiding this comment.
True, fixed. the generated results contains a new field needsReview for maintainer.
| // VRT module must have actually executed, and no `-t` filter narrowed it. | ||
| // Anything else (unrelated unit tests, filtered slug subset) leaves | ||
| // untouched slugs' files alone since we have no signal about them. | ||
| const ranVrtSuite = testModules.some((m) => m.moduleId === VRT_MODULE_PATH); |
There was a problem hiding this comment.
Codex finding:
This prunes when the VRT module appears and no
-tfilter is set, but it ignores Vitest’sreasonargument ("passed" | "interrupted" | "failed"). This can delete unrelated*.a11y.jsonfiles after a partial or failed local run. I’d fix this before broad rollout, but I’d call it non-blocking for the current PR. Best fix: accept the thirdonTestRunEndarg and only prune whenreason === 'passed'.
| const violations = [...new Set(results.violations.map((v) => v.id))]; | ||
| const needsReview = [...new Set(results.incomplete.map((v) => v.id))].filter( | ||
| (id) => !violations.includes(id), | ||
| ); |
There was a problem hiding this comment.
nit: having violationsSet and then being used in the needsReview's filter is faster. up to you if you think it's worth optimizing.
| const slugs = [...bySlug.keys()].sort(); | ||
| const partial = slugs.filter((s) => bySlug.get(s)!.some((m) => m.violations.length > 0)); | ||
| const needsReview = slugs.filter( | ||
| (s) => !partial.includes(s) && bySlug.get(s)!.some((m) => m.needsReview.length > 0), | ||
| ); | ||
| const pass = slugs.filter((s) => !partial.includes(s) && !needsReview.includes(s)); |
There was a problem hiding this comment.
same speed suggestion here, I would probably use sets to speed up all these including checks. we don't need them as arrays anyway here.
There was a problem hiding this comment.
got it but the gain is sub-millisecond, so I'd leave it like this for now.
mj12albert
left a comment
There was a problem hiding this comment.
Looks good ~
I pushed a commit to update the docs.
One follow up is to consider whether to surface regressions on a previously passed (and committed) non-visual axe rule in a CI step. Right now such a change could be committed and be visible in a PR diff but fully pass CI.
c2d3738 to
8e6e714
Compare
8e6e714 to
5d56eef
Compare
closes #47887 · supersedes #47895
Summary
Automated
axe-coreaccessibility testing for@mui/material, driven off the real docs demos. Folded into the existing visual-regression Playwright loop (test/regressions/index.test.js) — for each enrolled demo, axe runs against the rendered[data-testid="testcase"]element in the same browser session that takes the screenshot, and the result is written todocs/data/material/components/{slug}/{slug}.a11y.json(one file per slug, keyed by demo name).Initial scope is
buttonsonly —BasicButtonsandColorButtons— to validate the pipeline end-to-end before broader enrolment in follow-ups. Layout wrappers (Box,Grid,Stack,Paper,Container,Portal, …) will be excluded on principle when added: they contribute no a11y signal of their own; any violation found on them really belongs to their children, already covered by the child's own enrolment.Config shape
Two independent rule arrays — one per tool — evaluated last-match-wins (no inheritance: an override rule must restate every field it cares about) against the docs path
docs/data/material/components/{slug}/{Demo}(minimatch globs). Independent arrays mean editing one tool's config can't stomp the other.pnpm test:regressionsrefreshes the*.a11y.jsonfiles. CI enforces they're up to date via a git-diff check.For Reviewers
Why demo-based, not scenario-based
#47895 introduced scenario files (
test/a11y/scenarios/Button.a11y.tsx) that hand-authored every axe case per component. Per Olivier's feedback, docs demos already cover every real usage scenario — duplicating them as test fixtures is waste. This PR replaces the scenarios folder with a thin rule list pointed at the demo routes VRT already serves.Why folded into the VRT loop
@Janpot's earlier concern about duplicate Playwright sessions: rather than a parallel browser, axe now runs inside the existing screenshot loop. Each enrolled demo gets one
axe.runagainst the already-rendered[data-testid="testcase"]element before the screenshot is taken. Failure isolation is preserved by recording the result onctx.task.meta.a11yfirst; the assertion fires only on visual rules (color-contrast,link-in-text-block) that depend on real CSS.Screenshot and a11y opt-outs are independent — a demo screenshot-disabled for being redundant or interaction-heavy can still be audited by axe.
Config follow-up from review
@Janpot's review comment flagged the original dual-Map structure (
SLUG_A11Y+DEMO_META) as non-intuitive and at risk of cross-tool stomping. Replaced with two rule arrays + minimatch globs as suggested; per-demo opt-outs are appended after slug-wide rules; the override rule restates every field it wants (no implicit inheritance).Output shape
One file per enrolled slug at
docs/data/material/components/{slug}/{slug}.a11y.json, co-located with the slug's other docs files. Each file is an object keyed by demo name:{ "BasicButtons": { "passedRules": [...], "failedRules": [], "testedRules": { "wcag2a": [...], "wcag2aa": [...], "wcag22aa": [...] } }, "ColorButtons": { ... } }Co-locating with
{slug}.mdand the demo source files makes the data discoverable from the slug folder and avoids the slug-prefix collision problem of a flat per-demo layout.Code
test/regressions/demoMeta.ts— two rule arrays + resolvers (shouldScreenshot,resolveA11y).test/regressions/a11y/axe.ts—recordA11yrecords ontoctx.task.meta.a11y, asserts visual rules only.skipAssertionssuppresses the assertion for listed rule IDs without dropping them from recorded data — failing rules still land infailedRulesfor triage.test/regressions/a11y/a11yReporter.ts— Vitest reporter that groups results by slug and writes one file per slug todocs/data/material/components/{slug}/{slug}.a11y.json.test/regressions/index.test.js— callsrecordA11yinside the existing per-route loop, before the screenshot.test/regressions/demoMeta.test.ts— 8 unit tests for the resolvers (precedence, brace-glob enrolment, screenshot vs a11y independence).CI
The existing
test:regressionsjob already builds, serves, and runs Playwright. Adding axe is a no-op on infra. The git-diff guard ensures every*.a11y.jsonis up-to-date.Agent docs
AGENTS.md— "Accessibility Testing" section covers the rule-array shape, brace-glob narrowing, per-demo opt-outs, and theskipAssertionsescape hatch.Not in scope (follow-ups)
*.a11y.jsonvia the docs render pipeline and shows pass/fail counts plus a per-WCAG-tag rule breakdown on click.A11Y_RULES.skipAssertionslists it. Fine for a POC; may want finer control once more components enroll.