feat(apply): apply-time safety gates — declared-resource existence + drift preview/verify#173
Conversation
📝 WalkthroughWalkthroughThis PR implements apply-time and upgrade-time safety gates for the talm CLI, including host-resource existence validation, configuration drift preview, post-apply state verification, and post-upgrade version verification. It adds companion CLI skip flags and integrates reference extraction, disk selector matching, and multi-document YAML diffing infrastructure to support these phases. Additionally, it fixes the non-TTY issue with ChangesApply-time safety gates and init improvements
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
|
There was a problem hiding this comment.
Code Review
This pull request introduces the applycheck package to implement safety gates for Talos MachineConfig applications, including host-resource reference validation and structural diffing. These checks are integrated into the apply command with new flags to control resource validation, drift previews, and post-apply verification. A review comment identifies an improvement opportunity in error handling within cosiMachineConfigReader to prevent transient errors from being incorrectly reported as maintenance connection issues.
| if err != nil { | ||
| // PermissionDenied on the insecure path is the documented | ||
| // Sensitive-resource degrade signal. Distinguishing transient | ||
| // failures from auth-side denial would require parsing the | ||
| // gRPC status; for the pre-merge cut we treat any read | ||
| // failure as "could not read" and let the caller decide | ||
| // whether to surface that as a blocker (Phase 2B never does; | ||
| // Phase 2A never does either). | ||
| return nil, false, nil | ||
| } |
There was a problem hiding this comment.
The current implementation swallows the error from safe.StateGet. This can lead to misleading output in previewDrift and verifyAppliedState, where a transient network or API error is reported as a maintenance connection issue. It's better to return the error so the caller can provide accurate feedback.
| if err != nil { | |
| // PermissionDenied on the insecure path is the documented | |
| // Sensitive-resource degrade signal. Distinguishing transient | |
| // failures from auth-side denial would require parsing the | |
| // gRPC status; for the pre-merge cut we treat any read | |
| // failure as "could not read" and let the caller decide | |
| // whether to surface that as a blocker (Phase 2B never does; | |
| // Phase 2A never does either). | |
| return nil, false, nil | |
| } | |
| if err != nil { | |
| // PermissionDenied on the insecure path is the documented | |
| // Sensitive-resource degrade signal. Distinguishing transient | |
| // failures from auth-side denial would require parsing the | |
| // gRPC status; for the pre-merge cut we treat any read | |
| // failure as "could not read" and let the caller decide | |
| // whether to surface that as a blocker (Phase 2B never does; | |
| // Phase 2A never does either). | |
| return nil, false, err | |
| } |
Capture the matrix of cases the gates were validated against during PR #173 development: Phase 1 link / disk / selector inputs, Phase 2A diff classification, Phase 2B mode gating, real-Talos validation steps against the dev17 reference cluster. Documents the known limitations (upgrade has no gates, insecure path can't render with discovery, --init --update non-tty UX gap). Future-self reference for re-validation after touching applycheck or the preflight hooks. Refs: #172 Signed-off-by: Aleksei Sviridkin <f@lex.la>
Adds docs/manual-test-plan.md, a checklist for exercising talm against a real Talos cluster. Complementary to the narrow apply-safety-gates plan: it walks the whole CLI surface (init, template, apply, upgrade, rotate-ca, meta, reset, completion, read-only diagnostics) with expected outcomes and the failure modes to watch for. Each section captures the exact commands run during the dev17 adversarial sweep that produced the eight bugs PR #173 fixes, plus the destructive sanity-check block to run after reset / rotate-ca / mode=reboot to confirm cluster health. Extends naturally — new sections drop in alphabetically (K, L, ...). Also adds nit.md to .gitignore so the in-flight notes file doesn't get accidentally committed. Refs: #172 Signed-off-by: Aleksei Sviridkin <f@lex.la>
…kip + linksDisksReader three-valued + md reflow Blockers from branch-review pass 2 (#173): 1. Phase 2C ran WithClient even on --insecure upgrade. The maintenance / pre-auth connection cannot reach the auth-only COSI path; pre-fix the gate either silently surrendered with a 'version unreadable' line or, worse, connected to an unrelated node from the talosconfig context and verified its state. Extract shouldRunPostUpgradeVerify(insecure, staged, skip) pure predicate; capture --insecure flag BEFORE original RunE (alongside --image and --stage). Mirrors cosiMachineConfigReader's insecure-path branch in apply. 2. Phase 2C ran after --stage upgrades — guaranteed false positive blocker since the new partition isn't activated until the next reboot and runtime.Version still reports the OLD version. Same predicate now skips on --stage, mirroring shouldRunPostApplyVerify's STAGED case for apply. 3. docs/manual-test-plan.md K2-pre block claimed Phase 2C 'until it lands', contradicting the new K1-pre section right above and the README addition. Reworked K2-pre as the manual fallback for --skip-post-upgrade-verify / --insecure flows that the gate skips by design. K1-pre's expected output is synced with the actual emit (two-hypothesis hint, waiting line, skip list). 4. cosiLinksDisksReader had a two-valued (snapshot, ok) signature that conflated transient COSI errors with auth-disallowed. Mirror machineConfigReader's three-valued (snapshot, ok, err) shape: - err != nil -> transient (apid timeout, network blip); surface underlying cause, the operator's config isn't wrong. - !ok && err==nil -> auth-disallowed / resource unreachable by design; suggest --skip-resource-validation. - ok -> success. preflightValidateResources splits the two branches into distinct error messages (no more misleading 'config is wrong' on a 2s COSI timeout). New regression test TestPreflightValidateResources_TransientErr_WrapsUnderlying pins the contract. 5. Phase 2C silent no-op when both --nodes and talosconfig context's Nodes are empty. Now prints an explanatory line so the operator sees the gate they opted into didn't actually run. Recommendations addressed: - Phase 2C 90s wait now prints a 'waiting...' line up front so the operator's terminal isn't a mystery hang. - targetImage == '' path prints a skip notice instead of silent no-op. - transportNVMe const split into transportNVMe (COSI Disk.Transport value) + selectorTypeNVMe (v1alpha1 InstallDiskSelector.type enum). They're the same string today; documenting the contract separation prevents a silent future regression. - Phase 2C error message reworded to lead with both hypotheses ('auto-rolled back OR still booting'), matching the two- hypothesis hint constant. Documentation reflow (separate, no behaviour change): user pushed back on hardwrapped prose in human-rendered .md and PR/issue bodies. Reflowed docs/manual-test-plan.md, docs/apply-safety-gates-test-plan.md, nit.md, GitHub issue #178, and PR #173 body to one-line-per-paragraph per CLAUDE.md's prose rule. Tables, code blocks, and bullet structure preserved verbatim. Recommendation NOT addressed in this pass: the parentDir-shadowing-the-package-constant nit. Pre-existing across ~6 sites; folding it into this PR adds noise. Worth a follow-up issue (kind/cleanup, area/commands, good first issue). Refs: #172, #175 Signed-off-by: Aleksei Sviridkin <f@lex.la>
69c4d5d to
558f680
Compare
Adds four apply-time safety gates to talm (umbrella #172) plus the talm init --update non-tty UX fix from #174. The Phase 2C gate also closes #175 by detecting silent Talos auto-rollback after a failed cross-vendor upgrade. # Phase 1 — declared resource existence Pre-apply walker that scans the rendered MachineConfig for declared links and disks, then asserts each one resolves against the node's live COSI snapshot. Catches typos before they hit talosctl. Walker coverage: v1.12 multi-doc form (validates references to existing links): - LinkConfig.name - DHCPv4Config.name, DHCPv6Config.name, EthernetConfig.name - BondConfig.links[], BridgeConfig.links[] - VLANConfig.parent - Layer2VIPConfig.link, HCloudVIPConfig.link v1.12 multi-doc form (validates disk references): - UserVolumeConfig.provisioning.diskSelector v1.11 nested form: - machine.network.interfaces[].interface (link refs) - machine.install.disk (literal path) - machine.install.diskSelector (size/model/serial/wwid/type /busPath filter) - machine.disks[].device Selector semantics mirror Talos's v1alpha1_provider.go disk resolution: humanize.ParseBytes for size, virtual / readonly / CDROM exclusion, by-id / by-path stable symlinks accepted. The type-selector enum (nvme / sd / hdd / ssd) maps to (Transport, Rotational) per Talos; case handling is intentionally permissive (predicateType lowercases) — Phase 1 catches declared-resource mismatches, not Talos's input-format strictness. Bond / Bridge / VLAN / Wireguard / Dummy / LinkAlias config kinds are intentionally excluded from name-emission: their .name describes a NEW resource being created, not a reference to an existing one. The walker's multidocHandlers dispatch table distinguishes the two classes explicitly. Walker is YAML-only — no Talos machinery types in its surface — so contract tests in pkg/applycheck/refs_test.go run without standing up a live cluster. Real Talos YAML keys (BridgeConfig. links not ports, VLANConfig.parent not link) are pinned to test so a future Talos schema rename surfaces as a red test. Skip via --skip-resource-validation. Default on. Hint candidate lists are capped at 10 entries with a "... and N more" suffix so cozystack hosts with many dm/drbd/loop devices don't drown the actual finding line. # Phase 2A — pre-apply drift preview Reads the node's current MachineConfig via COSI and emits a +/-/~/= diff against the rendered config that's about to be applied. Informational — never blocks. Runs on every apply mode including --dry-run. Diff is doc-structural, keyed by (Kind, Name) for v1.12 multi-doc and synthetic MachineConfig{} for v1.11 root. Re-serialization order is irrelevant (TestDiff_KeyOrderIrrelevant). Slice values are surfaced as multiset add/remove buckets ("removed [127.0.0.1]" for certSANs duplicate-cleanup) rather than full-slice dumps; map / nested values render in YAML flow style ({k: v, k2: v2}) rather than Go's map[k:v k2:v2]. Empty nested maps emit as their own leaf so an absent / present transition surfaces visibly. Duplicate (Kind, Name) in a single doc stream is a config defect (Talos's behaviour on duplicates is unspecified) — parseDocs surfaces it as an error so the operator sees the real problem instead of a misleading drift line. Multi-node output is per-node-distinguished: the header includes "(node X.Y.Z.W)" when nodeID is non-empty so an operator scanning stderr from a multi-node apply can tell whose diff is whose. The auth-template path threads the node through cosiPreflightContext, which downshifts client.WithNodes (plural, single-element — needed for template lookup round-tripping) to client.WithNode (singular — needed because apid's COSI router rejects the plural metadata key); multi-element ctx is refused up front. On insecure / maintenance connection the MachineConfig resource is Sensitive and unreachable, so the gate degrades gracefully with a one-line "drift verification unavailable on maintenance connection" warning. Skip via --skip-drift-preview. Default on. # Phase 2B — post-apply state verification After ApplyConfiguration returns success, re-reads the on-node MachineConfig and structurally compares against the bytes that were sent. Divergence blocks with a per-document diff, catching silent doc drops (Talos parser ignored an unknown field) and controller reverts. Mode skips: - REBOOT / STAGED / TRY — verify cannot reach a stable ActiveID; pinned by shouldRunPostApplyVerify matrix. - AUTO — Talos's apply-server promotes AUTO to REBOOT internally when the change requires it, so the verify would race the reboot. Conservative skip; operators wanting verify on no-reboot AUTO applies pass --mode=no-reboot explicitly. - --dry-run — no apply happened. Reader err != nil is a hard blocker (the gate exists to catch silent rollbacks; swallowing a read error would defeat its purpose). ok=false err=nil on the auth path degrades to a maintenance-connection warning. Skip via --skip-post-apply-verify. Default OFF until the Talos- mutated-field allowlist lands (#172) — Talos mutates cert hashes and timestamps post-apply that surface as false-positive divergence today. # Phase 2C — post-upgrade version verify After talosctl upgrade returns success, waits 90s for the node to finish booting then reads runtime.Version via COSI and compares the running version's (Major, Minor) contract against the contract parsed from the target image tag. Cross-minor mismatch surfaces as a hint-bearing blocker with a two- hypothesis hint (silent A/B rollback after the new partition failed its boot readiness check, OR the node is slower than the reconcile window). Closes #175: cross-vendor upgrade silent no-op. The upgrade RPC acks success while Talos has actually rolled back to the previous partition (cross-vendor image, missing extensions, failed boot readiness check). The gate catches this by reading runtime.Version on the auth path post-boot. Target image is captured BEFORE talosctl's RunE — talosctl's upgrade flow can overwrite --image with the node's running install.image (no-op upgrade), which would mask the version mismatch the gate exists to catch. Mode skips: - --insecure — auth-only COSI path is unreachable; hard early-return in shouldRunPostUpgradeVerify. - --stage — new partition not yet activated until reboot; runtime.Version would always report the old version. Best-effort surrender on digest-pinned images and unparseable tags (the gate can't extract a version literal). Reader err is the rollback signal — a node that hung mid-boot looks like "connection refused" from COSI; surfaces as a hint-bearing blocker (TestVerifyPostUpgradeVersion_ReaderConnectionRefused_ NotSilent). versionReader signature is three-valued (string, bool, error) to distinguish real read failures from by-design unreachable cases. Multi-node verify is collect-then-block: every node's verdict is captured (errors.Join) rather than short-circuiting on the first failure. talosctl upgrade has already executed on every node by the time this loop runs, so stopping at the first failure would hide partial rollbacks. The 90s reconcile wait is deferred until after node resolution so an operator with an empty --nodes list doesn't sit through 90s of "waiting for the node to finish booting..." just to be told there was no target node. Skip via --skip-post-upgrade-verify. Default on. # init --update non-tty handling (#174) Closes #174. Previously, talm init --update under a non-tty stdin (CI, scripted) hung on an interactive overwrite prompt that EOF'd into a cryptic error. Now: - resolveOverwritePolicy(force, isTTY) is a pure predicate returning Ask / Force / NonInteractive. Non-tty without --force emits a hint-bearing error pointing the operator at --force. - stdinIsTTY uses term.IsTerminal instead of the naive ModeCharDevice check that falsely accepted /dev/null as a tty (and produced the original EOF wedge). - init --image runs validateImageRefShape before writing, rejecting malformed --image values (e.g. "::malformed", "no-slash:tag") up front instead of leaving a broken values.yaml that fails later inside configloader. # Implementation notes Engine package is unchanged. All gate logic lives in pkg/applycheck (walker, validator, differ, selector) and pkg/commands/preflight_apply_safety.go (Phase 1/2A/2B hooks + COSI readers) plus pkg/commands/preflight_upgrade_verify.go (Phase 2C). The three readers (linksDisksReader, machineConfigReader, versionReader) share the (snapshot, ok, err) three-valued contract for test-fake symmetry, though the policy on each output combination differs per gate. cosiLinksDisksReader gives each ListAll call its own context.WithTimeout budget — a shared budget would leak the first read's elapsed time into the second's deadline and produce false transient-timeout blockers on slow nodes. multisetDiff's equality key is fmt.Sprintf("%v", item) wrapped behind canonicalKey() — fmt sorts map keys for stable output (documented in the fmt package), and the wrapper isolates the contract for a future swap. Pinned by a 100-iteration stability test. # Documentation - README's "Apply-time safety gates" section describes all four gates with their skip-flag defaults. - docs/apply-safety-gates-test-plan.md — narrow QA reference for the gates: matrix of mode × skip-flag × insecure × --dry-run for every gate, real-Talos validation steps. - docs/manual-test-plan.md — broader end-to-end QA matrix covering the whole talm CLI surface. - golangci-lint timeout bumped to 10m in pr.yml to accommodate the additional package's analysis time. Test surface includes red+green pairs for every contract-level predicate (TDD per project rule) plus integration tests for the preflight hooks via function-type fakes. Walker, differ, validator, selector each have their own contract test file. TestNoWorkflowLeakageInRepoSource is extended to ban private operator identifiers from committed source. Closes #172 Closes #174 Closes #175 Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
558f680 to
7868946
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/commands/init.go`:
- Around line 1089-1092: The current error wrapping around
askUserOverwrite(filePath, policy) replaces the original error (e.g.,
overwritePolicyNonInteractive) with "failed to read user input", which is
misleading for non-interactive flows; update the error handling in the caller of
askUserOverwrite to preserve the original error context: detect or allow the
sentinel error returned by askUserOverwrite (such as
overwritePolicyNonInteractive) to be returned directly (or wrapped with a
message that includes the original error), and only use "failed to read user
input" for genuine stdin read failures; reference askUserOverwrite,
overwritePolicyNonInteractive, filePath and policy when making this change so
the non-tty path propagates its true error.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 6f332d6f-b787-4a4f-a1c5-c4ba2e2968ba
📒 Files selected for processing (27)
.github/workflows/pr.ymlREADME.mddocs/apply-safety-gates-test-plan.mddocs/manual-test-plan.mdgo.modpkg/applycheck/diff.gopkg/applycheck/diff_test.gopkg/applycheck/doc.gopkg/applycheck/refs.gopkg/applycheck/refs_test.gopkg/applycheck/selector.gopkg/applycheck/validate.gopkg/applycheck/validate_test.gopkg/commands/apply.gopkg/commands/apply_test.gopkg/commands/init.gopkg/commands/init_overwrite_test.gopkg/commands/init_test.gopkg/commands/preflight.gopkg/commands/preflight_apply_safety.gopkg/commands/preflight_apply_safety_test.gopkg/commands/preflight_test.gopkg/commands/preflight_upgrade_verify.gopkg/commands/preflight_upgrade_verify_test.gopkg/commands/upgrade_handler.gopkg/commands/upgrade_handler_test.gopkg/engine/engine_test.go
| overwrite, err := askUserOverwrite(filePath, policy) | ||
| if err != nil { | ||
| return errors.Wrap(err, "failed to read user input") | ||
| } |
There was a problem hiding this comment.
Preserve policy error context instead of forcing a “read input” message.
When overwritePolicyNonInteractive is hit, no stdin read occurs, but Line 1091 rewrites that path to "failed to read user input". This makes non-tty failures misleading.
Suggested fix
overwrite, err := askUserOverwrite(filePath, policy)
if err != nil {
- return errors.Wrap(err, "failed to read user input")
+ return err
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| overwrite, err := askUserOverwrite(filePath, policy) | |
| if err != nil { | |
| return errors.Wrap(err, "failed to read user input") | |
| } | |
| overwrite, err := askUserOverwrite(filePath, policy) | |
| if err != nil { | |
| return err | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/commands/init.go` around lines 1089 - 1092, The current error wrapping
around askUserOverwrite(filePath, policy) replaces the original error (e.g.,
overwritePolicyNonInteractive) with "failed to read user input", which is
misleading for non-interactive flows; update the error handling in the caller of
askUserOverwrite to preserve the original error context: detect or allow the
sentinel error returned by askUserOverwrite (such as
overwritePolicyNonInteractive) to be returned directly (or wrapped with a
message that includes the original error), and only use "failed to read user
input" for genuine stdin read failures; reference askUserOverwrite,
overwritePolicyNonInteractive, filePath and policy when making this change so
the non-tty path propagates its true error.
Four operator-UX polish changes on the safety surfaces introduced by #173. * #191 — Per-node prefix on the maintenance-connection warning. Multi-node insecure apply used to print identical bare 'talm: drift verification unavailable on maintenance connection' lines for every node with no per-node correlation. Both the previewDrift and verifyAppliedState emission sites now wrap through nodePrefix(nodeID); single-node case (empty nodeID) keeps the bare line so the common UX is unchanged. cosiPreflightContext got a fallback path to GlobalArgs.Nodes[0] when no outgoing-context metadata is attached — the maintenance flow (openClientPerNodeMaintenance) pins the node in GlobalArgs but does not attach context metadata, so without the fallback the per-node prefix collapsed to empty even when --nodes was set explicitly. Fallback is single-element-only; multi-element GlobalArgs.Nodes returns empty so a broken caller surfaces as 'no prefix' rather than 'wrong node prefix'. * #190 — Configurable post-upgrade reconcile window. The hardcoded 90s wait after talosctl upgrade returns is now --post-upgrade-reconcile-window. Default defaultPostUpgradeReconcileWindow = 90 * time.Second preserves byte-identical back-compat. validatePostUpgradeReconcileWindow rejects non-positive values at the TOP of wrapUpgradeCommand RunE — fail-fast BEFORE any talosctl RPC fires, so an operator's '=0s' typo cannot land a partial upgrade. The version-mismatch hint copy and error body de-hardcode the '90s' literal to reference 'the configured reconcile window' (defaultPostUpgradeReconcileWindow flows through verifyPostUpgradeVersion). * #192 — Syntactic net-addr walker for three v1alpha1 multidoc kinds. Validates fields the actual Talos schema emits: * StaticHostConfig.name (IP literal; the `name` field on this kind doubles as the IP — there is no separate `address` field). * NetworkRuleConfig.ingress[].subnet and .except (CIDR shapes; no top-level matchSourceAddress[] field exists in Talos's schema). * WireguardConfig.peers[].endpoint (host:port; empty / absent endpoint is a listener-only peer and NOT a finding). Walker runs in parallel with the Ref-based walker via multidocNetAddrHandlers; dispatch-table disjointness with multidocHandlers is pinned by a unit test. Integration tests verify Phase 1 blocks on bad and passes on valid. * #189 — Drift preview redacts secret-bearing field values by default. secretFieldPaths allowlist with bracket-normalisation regex; secret check fires BEFORE bothSlices in formatFieldChangeLine so slice-shaped allowlist entries (cluster.acceptedCAs, machine.acceptedCAs, peers) never leak via formatSliceSetDiff. Length-disclosing redaction sentinel (***redacted (len=N)***) preserves rotation signal without leaking the value. Non-string secret values route through redactValue(fmt.Sprintf("%v", value)) so int / bool rotations still surface as different-length sentinels. Map iteration order non-determinism is disclaimed in the godoc (no map-shaped allowlist entry today). New flag --show-secrets-in-drift (default off) plumbed through previewDrift / verifyAppliedState / printDriftPreview / formatFieldChangeLine. apply.go callers pass applyCmdFlags.showSecretsInDrift. absentFieldValue const hoisted so formatFieldValue and formatSecretFieldValue stay byte-identical on the absent path — add/remove vs rotate stays distinguishable. Doc updates: * README.md bullet #1 (Phase 1) now mentions the net-addr walker; bullet #2 (Phase 2A) mentions redaction + --show-secrets-in-drift + the per-node prefix; bullet #4 (Phase 2C) says 'the configured reconcile window' instead of hardcoding 90s. * docs/manual-test-plan.md gains sections C5/C6/C7 (redact, flag opt-out, walker), D3 (per-node prefix), E3 (reconcile window with help-text + 0s rejection), M6 (false-positive guard for redaction), M7 (walker boundary cases). Forward- looking 'do X, expect Y' shape with explicit regression anchors. * docs/apply-safety-gates-test-plan.md gets matching rows in Phase 1 (net-addr walker), Phase 2A (redaction + per-node prefix), and Phase 2C (reconcile window) tables. Test coverage: * Walker: per-kind table tests with the real schema shape, plus TestWalkNetAddrFindings_RealSchema_StaticHostConfig and ..._NetworkRuleConfig as schema-anchor regression pins so a future drift back to non-existent fields fails these tests rather than passing trivially. * Reconcile window: validator boundary cases, flag default one-(default)-clause pin, fail-fast ordering pin via a sentinel originalRunE, hint-text no-hardcoded-90s pin, a README-no-hardcoded-90s pin so README and code stay in sync. * Per-node prefix: maintenance message on both emission sites, empty-nodeID bare-line guard, single-vs-multi GlobalArgs.Nodes fallback pair. * Redaction: exact-match, Wireguard paths, slice-rotation no-leak (CA-list and peers shapes), bracket-normalisation, false-prefix guard, non-secret control, opt-in flag, non-string redaction with length-signal, absent-side add-vs-rotate distinction. Closes #189, closes #190, closes #191, closes #192. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
Four operator-UX polish changes on the safety surfaces introduced by #173. * #191 — Per-node prefix on the maintenance-connection warning. Multi-node insecure apply used to print identical bare 'talm: drift verification unavailable on maintenance connection' lines for every node with no per-node correlation. Both the previewDrift and verifyAppliedState emission sites now wrap through nodePrefix(nodeID); single-node case (empty nodeID) keeps the bare line so the common UX is unchanged. cosiPreflightContext got a fallback path to GlobalArgs.Nodes[0] when no outgoing-context metadata is attached — the maintenance flow (openClientPerNodeMaintenance) pins the node in GlobalArgs but does not attach context metadata, so without the fallback the per-node prefix collapsed to empty even when --nodes was set explicitly. Fallback is single-element-only; multi-element GlobalArgs.Nodes returns empty so a broken caller surfaces as 'no prefix' rather than 'wrong node prefix'. * #190 — Configurable post-upgrade reconcile window. The hardcoded 90s wait after talosctl upgrade returns is now --post-upgrade-reconcile-window. Default defaultPostUpgradeReconcileWindow = 90 * time.Second preserves byte-identical back-compat. validatePostUpgradeReconcileWindow rejects non-positive values at the TOP of wrapUpgradeCommand RunE — fail-fast BEFORE any talosctl RPC fires, so an operator's '=0s' typo cannot land a partial upgrade. The version-mismatch hint copy and error body de-hardcode the '90s' literal to reference 'the configured reconcile window' (defaultPostUpgradeReconcileWindow flows through verifyPostUpgradeVersion). * #192 — Syntactic net-addr walker for three v1alpha1 multidoc kinds. Validates fields the actual Talos schema emits: * StaticHostConfig.name (IP literal; the `name` field on this kind doubles as the IP — there is no separate `address` field). * NetworkRuleConfig.ingress[].subnet and .except (CIDR shapes; no top-level matchSourceAddress[] field exists in Talos's schema). * WireguardConfig.peers[].endpoint (host:port; empty / absent endpoint is a listener-only peer and NOT a finding). Walker runs in parallel with the Ref-based walker via multidocNetAddrHandlers; dispatch-table disjointness with multidocHandlers is pinned by a unit test. Integration tests verify Phase 1 blocks on bad and passes on valid. * #189 — Drift preview redacts secret-bearing field values by default. secretFieldPaths allowlist with bracket-normalisation regex; secret check fires BEFORE bothSlices in formatFieldChangeLine so slice-shaped allowlist entries (cluster.acceptedCAs, machine.acceptedCAs, peers) never leak via formatSliceSetDiff. Length-disclosing redaction sentinel (***redacted (len=N)***) preserves rotation signal without leaking the value. Non-string secret values route through redactValue(fmt.Sprintf("%v", value)) so int / bool rotations still surface as different-length sentinels. Map iteration order non-determinism is disclaimed in the godoc (no map-shaped allowlist entry today). New flag --show-secrets-in-drift (default off) plumbed through previewDrift / verifyAppliedState / printDriftPreview / formatFieldChangeLine. apply.go callers pass applyCmdFlags.showSecretsInDrift. absentFieldValue const hoisted so formatFieldValue and formatSecretFieldValue stay byte-identical on the absent path — add/remove vs rotate stays distinguishable. Doc updates: * README.md bullet #1 (Phase 1) now mentions the net-addr walker; bullet #2 (Phase 2A) mentions redaction + --show-secrets-in-drift + the per-node prefix; bullet #4 (Phase 2C) says 'the configured reconcile window' instead of hardcoding 90s. * docs/manual-test-plan.md gains sections C5/C6/C7 (redact, flag opt-out, walker), D3 (per-node prefix), E3 (reconcile window with help-text + 0s rejection), M6 (false-positive guard for redaction), M7 (walker boundary cases). Forward- looking 'do X, expect Y' shape with explicit regression anchors. * docs/apply-safety-gates-test-plan.md gets matching rows in Phase 1 (net-addr walker), Phase 2A (redaction + per-node prefix), and Phase 2C (reconcile window) tables. Test coverage: * Walker: per-kind table tests with the real schema shape, plus TestWalkNetAddrFindings_RealSchema_StaticHostConfig and ..._NetworkRuleConfig as schema-anchor regression pins so a future drift back to non-existent fields fails these tests rather than passing trivially. * Reconcile window: validator boundary cases, flag default one-(default)-clause pin, fail-fast ordering pin via a sentinel originalRunE, hint-text no-hardcoded-90s pin, a README-no-hardcoded-90s pin so README and code stay in sync. * Per-node prefix: maintenance message on both emission sites, empty-nodeID bare-line guard, single-vs-multi GlobalArgs.Nodes fallback pair. * Redaction: exact-match, Wireguard paths, slice-rotation no-leak (CA-list and peers shapes), bracket-normalisation, false-prefix guard, non-secret control, opt-in flag, non-string redaction with length-signal, absent-side add-vs-rotate distinction. Closes #189, closes #190, closes #191, closes #192. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
Four operator-UX polish changes on the safety surfaces introduced by #173. * #191 — Per-node prefix on the maintenance-connection warning. Multi-node insecure apply used to print identical bare 'talm: drift verification unavailable on maintenance connection' lines for every node with no per-node correlation. Both the previewDrift and verifyAppliedState emission sites now wrap through nodePrefix(nodeID); single-node case (empty nodeID) keeps the bare line so the common UX is unchanged. cosiPreflightContext got a fallback path to GlobalArgs.Nodes[0] when no outgoing-context metadata is attached — the maintenance flow (openClientPerNodeMaintenance) pins the node in GlobalArgs but does not attach context metadata, so without the fallback the per-node prefix collapsed to empty even when --nodes was set explicitly. Fallback is single-element-only; multi-element GlobalArgs.Nodes returns empty so a broken caller surfaces as 'no prefix' rather than 'wrong node prefix'. * #190 — Configurable post-upgrade reconcile window. The hardcoded 90s wait after talosctl upgrade returns is now --post-upgrade-reconcile-window. Default defaultPostUpgradeReconcileWindow = 90 * time.Second preserves byte-identical back-compat. validatePostUpgradeReconcileWindow rejects non-positive values at the TOP of wrapUpgradeCommand RunE — fail-fast BEFORE any talosctl RPC fires, so an operator's '=0s' typo cannot land a partial upgrade. The version-mismatch hint copy and error body de-hardcode the '90s' literal to reference 'the configured reconcile window' (defaultPostUpgradeReconcileWindow flows through verifyPostUpgradeVersion). * #192 — Syntactic net-addr walker for three v1alpha1 multidoc kinds. Validates fields the actual Talos schema emits: * StaticHostConfig.name (IP literal; the `name` field on this kind doubles as the IP — there is no separate `address` field). * NetworkRuleConfig.ingress[].subnet and .except (CIDR shapes; no top-level matchSourceAddress[] field exists in Talos's schema). * WireguardConfig.peers[].endpoint (host:port; empty / absent endpoint is a listener-only peer and NOT a finding). Walker runs in parallel with the Ref-based walker via multidocNetAddrHandlers; dispatch-table disjointness with multidocHandlers is pinned by a unit test. Integration tests verify Phase 1 blocks on bad and passes on valid. * #189 — Drift preview redacts secret-bearing field values by default. secretFieldPaths allowlist with bracket-normalisation regex; secret check fires BEFORE bothSlices in formatFieldChangeLine so slice-shaped allowlist entries (cluster.acceptedCAs, machine.acceptedCAs, peers) never leak via formatSliceSetDiff. Length-disclosing redaction sentinel (***redacted (len=N)***) preserves rotation signal without leaking the value. Non-string secret values route through redactValue(fmt.Sprintf("%v", value)) so int / bool rotations still surface as different-length sentinels. Map iteration order non-determinism is disclaimed in the godoc (no map-shaped allowlist entry today). New flag --show-secrets-in-drift (default off) plumbed through previewDrift / verifyAppliedState / printDriftPreview / formatFieldChangeLine. apply.go callers pass applyCmdFlags.showSecretsInDrift. absentFieldValue const hoisted so formatFieldValue and formatSecretFieldValue stay byte-identical on the absent path — add/remove vs rotate stays distinguishable. Doc updates: * README.md bullet #1 (Phase 1) now mentions the net-addr walker; bullet #2 (Phase 2A) mentions redaction + --show-secrets-in-drift + the per-node prefix; bullet #4 (Phase 2C) says 'the configured reconcile window' instead of hardcoding 90s. * docs/manual-test-plan.md gains sections C5/C6/C7 (redact, flag opt-out, walker), D3 (per-node prefix), E3 (reconcile window with help-text + 0s rejection), M6 (false-positive guard for redaction), M7 (walker boundary cases). Forward- looking 'do X, expect Y' shape with explicit regression anchors. * docs/apply-safety-gates-test-plan.md gets matching rows in Phase 1 (net-addr walker), Phase 2A (redaction + per-node prefix), and Phase 2C (reconcile window) tables. Test coverage: * Walker: per-kind table tests with the real schema shape, plus TestWalkNetAddrFindings_RealSchema_StaticHostConfig and ..._NetworkRuleConfig as schema-anchor regression pins so a future drift back to non-existent fields fails these tests rather than passing trivially. * Reconcile window: validator boundary cases, flag default one-(default)-clause pin, fail-fast ordering pin via a sentinel originalRunE, hint-text no-hardcoded-90s pin, a README-no-hardcoded-90s pin so README and code stay in sync. * Per-node prefix: maintenance message on both emission sites, empty-nodeID bare-line guard, single-vs-multi GlobalArgs.Nodes fallback pair. * Redaction: exact-match, Wireguard paths, slice-rotation no-leak (CA-list and peers shapes), bracket-normalisation, false-prefix guard, non-secret control, opt-in flag, non-string redaction with length-signal, absent-side add-vs-rotate distinction. Closes #189, closes #190, closes #191, closes #192. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
Summary
Implements the apply-time safety gates umbrella with four gates plus the
talm init --updatenon-tty UX fix surfaced during real-cluster validation:Phase 1 — declared-resource existence (block-by-default,
--skip-resource-validationopt-out): walks the rendered MachineConfig and verifies every declared link / disk reference against the node's live COSI snapshot. v1.12 multi-doc form coversLinkConfig.name,DHCPv4/6Config.name,EthernetConfig.name,BondConfig.links[],BridgeConfig.links[],VLANConfig.parent,Layer2VIPConfig.link,HCloudVIPConfig.link, andUserVolumeConfig.provisioning.diskSelector. v1.11 nested form coversmachine.network.interfaces[].interface,machine.install.disk,machine.install.diskSelector, andmachine.disks[].device. Virtual-creator documents (BondConfig.name,VLANConfig.name, etc.) are intentionally skipped — their.namedescribes a new resource being created, not an existing-link reference. Findings cite the source JSONPath so operators can locate the offending value directly. Hint candidate lists are capped at 10 entries with a... and N moresuffix.Phase 2A — pre-apply drift preview (
--skip-drift-previewopt-out): reads on-node MachineConfig via COSI and diffs against what's about to be sent, rendering+/-/~/=per(kind, name)pair. Informational — never blocks. Runs on--dry-runtoo. Slice values surface as set-diff buckets (removed [127.0.0.1]for certSANs duplicate-cleanup) rather than full-slice dumps; map / nested values render in YAML flow style. Multi-node output is per-node-distinguished via(node X.Y.Z.W)in the header so multi-node stderr is disambiguated.Phase 2B — post-apply state verification (
--skip-post-apply-verifyopt-out, default off until the Talos-mutated-field allowlist lands — see #172): re-reads on-node MachineConfig afterApplyConfigurationsuccess and structurally compares against the bytes sent. Blocks on divergence. Auto-skipped on--mode=staged/--mode=try/--mode=reboot/--mode=auto/--dry-run— each for a documented reason. The AUTO skip in particular is non-obvious: Talos's apply server promotes AUTO to REBOOT internally when the change requires it, so the verify would race the reboot.Phase 2C — post-upgrade version verify (closes #175,
--skip-post-upgrade-verifyopt-out, default on): aftertalosctl upgradereports success, waits the reconcile window, readsruntime.Versionvia COSI, and compares the running version's(Major, Minor)contract against the contract parsed from the target image tag. Catches the silent A/B rollback case where Talos rolls back to the previous partition (cross-vendor image, missing extensions, failed boot readiness check) yet the upgrade RPC has already acked. Cross-minor mismatch surfaces as a hint-bearing blocker with a two-hypothesis hint (rollback OR slow boot beyond the reconcile window). Multi-node verify is collect-then-block viaerrors.Join. Best-effort surrender on digest-pinned images and unparseable tags. Skipped on--insecure(auth-only COSI path) and--stage(new partition not yet booted).Both Phase 2 gates that run on the auth path degrade gracefully on the insecure / maintenance connection: MachineConfig is
meta.Sensitive, the reader returnsok=falsedeterministically (the maintenance-vs-auth path is statically known from--insecure), and the gate prints one explanatory line without blocking.talm init --updatenon-tty handling (closes #174): the preset-template overwrite prompt used to wedge onEOFwhen stdin wasn't a tty, silently leaving the project on a stale preset. Now resolves to one of three policies via a pureresolveOverwritePolicy(force, isTTY)predicate:--forceaccepts all overwrites with a per-file log line; tty + no--forcekeeps the historical interactive prompt; non-tty + no--forceerrors with a hint pointing at--force.stdinIsTTYusesterm.IsTerminalinstead of the naiveModeCharDevicecheck that falsely accepted/dev/nullas a tty.--imageruns throughvalidateImageRefShapebefore writing, rejecting malformed values (::malformed,no-slash:tag) up front instead of leaving a brokenvalues.yaml.Tests
TDD red-first for every contract-level predicate. ~3,400 LOC of test code across
pkg/applycheck(walker, validator, differ, selector contract tests) andpkg/commands(preflight hook integration via function-type fakes, mode-gating matrices, multi-node collect-then-block, wall-clock regression pin for the post-upgrade reconcile-window placement). Lint clean on host andGOOS=windows,go vetclean, race detector clean.TestNoWorkflowLeakageInRepoSourceis extended withdev17/lexfreito ban private operator identifiers from committed source.Real-cluster validation
Exercised against a 3-node OCI Talos v1.12.6 reference stand. Each gate's behavior was verified end-to-end through the auth and insecure paths; Phase 2C was confirmed against an intentional cross-vendor upgrade (cozystack-bundled → vanilla siderolabs installer) that produced the documented two-hypothesis blocker. Test plan reference at
docs/apply-safety-gates-test-plan.mdfor future re-validation.Closes
Closes #172. Closes #174. Closes #175.