Skip to content

feat(apply): apply-time safety gates — declared-resource existence + drift preview/verify#173

Merged
Aleksei Sviridkin (lexfrei) merged 1 commit into
mainfrom
feat/apply-safety-gates
May 12, 2026
Merged

feat(apply): apply-time safety gates — declared-resource existence + drift preview/verify#173
Aleksei Sviridkin (lexfrei) merged 1 commit into
mainfrom
feat/apply-safety-gates

Conversation

@lexfrei

@lexfrei Aleksei Sviridkin (lexfrei) commented May 11, 2026

Copy link
Copy Markdown
Contributor

Summary

Implements the apply-time safety gates umbrella with four gates plus the talm init --update non-tty UX fix surfaced during real-cluster validation:

Phase 1 — declared-resource existence (block-by-default, --skip-resource-validation opt-out): walks the rendered MachineConfig and verifies every declared link / disk reference against the node's live COSI snapshot. v1.12 multi-doc form covers LinkConfig.name, DHCPv4/6Config.name, EthernetConfig.name, BondConfig.links[], BridgeConfig.links[], VLANConfig.parent, Layer2VIPConfig.link, HCloudVIPConfig.link, and UserVolumeConfig.provisioning.diskSelector. v1.11 nested form covers machine.network.interfaces[].interface, machine.install.disk, machine.install.diskSelector, and machine.disks[].device. Virtual-creator documents (BondConfig.name, VLANConfig.name, etc.) are intentionally skipped — their .name describes a new resource being created, not an existing-link reference. Findings cite the source JSONPath so operators can locate the offending value directly. Hint candidate lists are capped at 10 entries with a ... and N more suffix.

Phase 2A — pre-apply drift preview (--skip-drift-preview opt-out): reads on-node MachineConfig via COSI and diffs against what's about to be sent, rendering +/-/~/= per (kind, name) pair. Informational — never blocks. Runs on --dry-run too. Slice values surface as set-diff buckets (removed [127.0.0.1] for certSANs duplicate-cleanup) rather than full-slice dumps; map / nested values render in YAML flow style. Multi-node output is per-node-distinguished via (node X.Y.Z.W) in the header so multi-node stderr is disambiguated.

Phase 2B — post-apply state verification (--skip-post-apply-verify opt-out, default off until the Talos-mutated-field allowlist lands — see #172): re-reads on-node MachineConfig after ApplyConfiguration success and structurally compares against the bytes sent. Blocks on divergence. Auto-skipped on --mode=staged / --mode=try / --mode=reboot / --mode=auto / --dry-run — each for a documented reason. The AUTO skip in particular is non-obvious: Talos's apply server promotes AUTO to REBOOT internally when the change requires it, so the verify would race the reboot.

Phase 2C — post-upgrade version verify (closes #175, --skip-post-upgrade-verify opt-out, default on): after talosctl upgrade reports success, waits the reconcile window, reads runtime.Version via COSI, and compares the running version's (Major, Minor) contract against the contract parsed from the target image tag. Catches the silent A/B rollback case where Talos rolls back to the previous partition (cross-vendor image, missing extensions, failed boot readiness check) yet the upgrade RPC has already acked. Cross-minor mismatch surfaces as a hint-bearing blocker with a two-hypothesis hint (rollback OR slow boot beyond the reconcile window). Multi-node verify is collect-then-block via errors.Join. Best-effort surrender on digest-pinned images and unparseable tags. Skipped on --insecure (auth-only COSI path) and --stage (new partition not yet booted).

Both Phase 2 gates that run on the auth path degrade gracefully on the insecure / maintenance connection: MachineConfig is meta.Sensitive, the reader returns ok=false deterministically (the maintenance-vs-auth path is statically known from --insecure), and the gate prints one explanatory line without blocking.

talm init --update non-tty handling (closes #174): the preset-template overwrite prompt used to wedge on EOF when stdin wasn't a tty, silently leaving the project on a stale preset. Now resolves to one of three policies via a pure resolveOverwritePolicy(force, isTTY) predicate: --force accepts all overwrites with a per-file log line; tty + no --force keeps the historical interactive prompt; non-tty + no --force errors with a hint pointing at --force. stdinIsTTY uses term.IsTerminal instead of the naive ModeCharDevice check that falsely accepted /dev/null as a tty. --image runs through validateImageRefShape before writing, rejecting malformed values (::malformed, no-slash:tag) up front instead of leaving a broken values.yaml.

Tests

TDD red-first for every contract-level predicate. ~3,400 LOC of test code across pkg/applycheck (walker, validator, differ, selector contract tests) and pkg/commands (preflight hook integration via function-type fakes, mode-gating matrices, multi-node collect-then-block, wall-clock regression pin for the post-upgrade reconcile-window placement). Lint clean on host and GOOS=windows, go vet clean, race detector clean. TestNoWorkflowLeakageInRepoSource is extended with dev17 / lexfrei to ban private operator identifiers from committed source.

Real-cluster validation

Exercised against a 3-node OCI Talos v1.12.6 reference stand. Each gate's behavior was verified end-to-end through the auth and insecure paths; Phase 2C was confirmed against an intentional cross-vendor upgrade (cozystack-bundled → vanilla siderolabs installer) that produced the documented two-hypothesis blocker. Test plan reference at docs/apply-safety-gates-test-plan.md for future re-validation.

Closes

Closes #172. Closes #174. Closes #175.

@coderabbitai

coderabbitai Bot commented May 11, 2026

Copy link
Copy Markdown
📝 Walkthrough

Walkthrough

This PR implements apply-time and upgrade-time safety gates for the talm CLI, including host-resource existence validation, configuration drift preview, post-apply state verification, and post-upgrade version verification. It adds companion CLI skip flags and integrates reference extraction, disk selector matching, and multi-document YAML diffing infrastructure to support these phases. Additionally, it fixes the non-TTY issue with talm init --update by introducing an overwrite policy abstraction and TTY detection.

Changes

Apply-time safety gates and init improvements

Layer / File(s) Summary
Reference extraction and host-snapshot types
pkg/applycheck/doc.go, pkg/applycheck/refs.go, pkg/applycheck/refs_test.go
RefKind, DiskSelector, and Ref types to model host-resource references extracted from v1.11 nested and v1.12 multi-document YAML. WalkRefs decodes documents and emits refs for network links, disk literals, and disk selectors with JSONPath-like source paths.
Disk selector matching and validation
pkg/applycheck/selector.go, pkg/applycheck/validate.go, pkg/applycheck/validate_test.go
Disk selector predicates (glob/exact/type/size) with exclusion rules (readonly, CD-ROM, virtual), deterministic hint formatting with "… and N more" truncation, and ValidateRefs entry point that checks ref resolution against host snapshots and produces blocker/warning Finding results.
Multi-document YAML diff engine
pkg/applycheck/diff.go, pkg/applycheck/diff_test.go
Diff parses current/desired snapshots by DocID, emits per-document Change entries (add/remove/update/equal), and performs leaf-level FieldChange diffing that distinguishes absent fields from present-but-null. FilterChanged and FormatChange utilities support downstream use.
Phase 1: Host-resource validation
pkg/commands/preflight_apply_safety.go (lines 1–141), pkg/commands/preflight_apply_safety_test.go (lines 72–165)
preflightValidateResources walks rendered config refs, reads COSI host snapshot (links and disks), validates all refs resolve, and returns blocker errors with truncated hint lists when validation fails. Gracefully handles reader unavailability on insecure paths.
Phase 2A/2B: Drift preview and post-apply verification
pkg/commands/preflight_apply_safety.go (lines 159–618), pkg/commands/preflight_apply_safety_test.go (lines 166–871)
Phase 2A previews drift via previewDrift with on-node MachineConfig diff, multiset-aware slice set-diffing, and YAML flow-style single-line field rendering. Phase 2B re-reads post-apply state via verifyAppliedState with reconcile window and blocks on structural divergence. Both support per-node headers and graceful degradation on maintenance paths. Tests validate all scenarios including scheduling predicates and rendering edge cases.
Phase 2C: Post-upgrade version verification
pkg/commands/preflight_upgrade_verify.go, pkg/commands/preflight_upgrade_verify_test.go
parseTargetVersion extracts version literals from image tags, verifyPostUpgradeVersion reads running version and enforces major/minor match, distinguishes connection errors from slow-boot scenarios in hints, and skips silently on digest pins or missing tags.
Integration into apply and preflight commands
pkg/commands/apply.go, pkg/commands/apply_test.go, pkg/commands/preflight.go, pkg/commands/preflight_test.go
Threads resolved node ID through cosiPreflightContext returning (ctx, nodeID, error), wires Phase 1 and Phase 2 gates per-node in both template-render and direct-patch modes, adds --skip-resource-validation, --skip-drift-preview, --skip-post-apply-verify flags. Updates versionReader contract to (version, ok, err) 3-tuple with test stub support.
Integration into upgrade command
pkg/commands/upgrade_handler.go, pkg/commands/upgrade_handler_test.go
Captures pre-upgrade --image, --insecure, --stage, implements 90s post-upgrade reconcile window, resolves target nodes from CLI or talosconfig, runs per-node version verification collecting all failures, and gates via shouldRunPostUpgradeVerify predicate.
Init --update overwrite policy and TTY detection
pkg/commands/init.go, pkg/commands/init_overwrite_test.go
Introduces overwritePolicy enum (force/ask/non-interactive) with TTY detection via golang.org/x/term. Refactors askUserOverwrite to auto-overwrite under --force, block in non-interactive mode with clear --force hint, or prompt on TTY. Centralizes "Created …" reporting via createReportedFile. Tests validate all policy branches and TTY/non-TTY behavior.
Init image reference validation
pkg/commands/init.go, pkg/commands/init_test.go
Tightens imageLineRe to require space after colon in YAML image: keys. Adds validateImageRefShape to enforce OCI structure (requires / and :tag or @sha256:/sha512: digest), integrated into validateImageOverride before preset scanning. Tests validate malformed rejection and error reporting.
Documentation, test infrastructure, and CI
README.md, docs/apply-safety-gates-test-plan.md, docs/manual-test-plan.md, go.mod, .github/workflows/pr.yml, pkg/engine/engine_test.go
README section describing all four safety gates, their defaults, and per-phase skip flags. Comprehensive safety gates reference test plan (Phase 1–2C, real-Talos validation steps, limitations). Manual end-to-end test matrix for talm workflows. Promotes go-humanize and golang.org/x/term to direct dependencies. Increases golangci-lint timeout from 5m to 10m. Extends banned-phrase list for workflow leakage checks.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 A rabbit's ode to safety gates:
Four phases guard thy config's fate,
With refs extracted, diffs displayed,
And TTY prompts that won't degrade—
Now preset updates reach the gate,
And upgrades verify their state! 🎉

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/apply-safety-gates

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the applycheck package to implement safety gates for Talos MachineConfig applications, including host-resource reference validation and structural diffing. These checks are integrated into the apply command with new flags to control resource validation, drift previews, and post-apply verification. A review comment identifies an improvement opportunity in error handling within cosiMachineConfigReader to prevent transient errors from being incorrectly reported as maintenance connection issues.

Comment on lines +304 to +313
if err != nil {
// PermissionDenied on the insecure path is the documented
// Sensitive-resource degrade signal. Distinguishing transient
// failures from auth-side denial would require parsing the
// gRPC status; for the pre-merge cut we treat any read
// failure as "could not read" and let the caller decide
// whether to surface that as a blocker (Phase 2B never does;
// Phase 2A never does either).
return nil, false, nil
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current implementation swallows the error from safe.StateGet. This can lead to misleading output in previewDrift and verifyAppliedState, where a transient network or API error is reported as a maintenance connection issue. It's better to return the error so the caller can provide accurate feedback.

Suggested change
if err != nil {
// PermissionDenied on the insecure path is the documented
// Sensitive-resource degrade signal. Distinguishing transient
// failures from auth-side denial would require parsing the
// gRPC status; for the pre-merge cut we treat any read
// failure as "could not read" and let the caller decide
// whether to surface that as a blocker (Phase 2B never does;
// Phase 2A never does either).
return nil, false, nil
}
if err != nil {
// PermissionDenied on the insecure path is the documented
// Sensitive-resource degrade signal. Distinguishing transient
// failures from auth-side denial would require parsing the
// gRPC status; for the pre-merge cut we treat any read
// failure as "could not read" and let the caller decide
// whether to surface that as a blocker (Phase 2B never does;
// Phase 2A never does either).
return nil, false, err
}

Aleksei Sviridkin (lexfrei) added a commit that referenced this pull request May 11, 2026
Capture the matrix of cases the gates were validated against during
PR #173 development: Phase 1 link / disk / selector inputs, Phase 2A
diff classification, Phase 2B mode gating, real-Talos validation
steps against the dev17 reference cluster. Documents the known
limitations (upgrade has no gates, insecure path can't render with
discovery, --init --update non-tty UX gap).

Future-self reference for re-validation after touching applycheck
or the preflight hooks.

Refs: #172
Signed-off-by: Aleksei Sviridkin <f@lex.la>
Aleksei Sviridkin (lexfrei) added a commit that referenced this pull request May 11, 2026
Adds docs/manual-test-plan.md, a checklist for exercising talm
against a real Talos cluster. Complementary to the narrow
apply-safety-gates plan: it walks the whole CLI surface (init,
template, apply, upgrade, rotate-ca, meta, reset, completion,
read-only diagnostics) with expected outcomes and the failure
modes to watch for.

Each section captures the exact commands run during the dev17
adversarial sweep that produced the eight bugs PR #173 fixes,
plus the destructive sanity-check block to run after reset /
rotate-ca / mode=reboot to confirm cluster health.

Extends naturally — new sections drop in alphabetically (K, L, ...).

Also adds nit.md to .gitignore so the in-flight notes file
doesn't get accidentally committed.

Refs: #172
Signed-off-by: Aleksei Sviridkin <f@lex.la>
Aleksei Sviridkin (lexfrei) added a commit that referenced this pull request May 12, 2026
…kip + linksDisksReader three-valued + md reflow

Blockers from branch-review pass 2 (#173):

1. Phase 2C ran WithClient even on --insecure upgrade. The
   maintenance / pre-auth connection cannot reach the auth-only
   COSI path; pre-fix the gate either silently surrendered with a
   'version unreadable' line or, worse, connected to an unrelated
   node from the talosconfig context and verified its state.
   Extract shouldRunPostUpgradeVerify(insecure, staged, skip)
   pure predicate; capture --insecure flag BEFORE original RunE
   (alongside --image and --stage). Mirrors
   cosiMachineConfigReader's insecure-path branch in apply.

2. Phase 2C ran after --stage upgrades — guaranteed false positive
   blocker since the new partition isn't activated until the next
   reboot and runtime.Version still reports the OLD version. Same
   predicate now skips on --stage, mirroring
   shouldRunPostApplyVerify's STAGED case for apply.

3. docs/manual-test-plan.md K2-pre block claimed Phase 2C 'until
   it lands', contradicting the new K1-pre section right above and
   the README addition. Reworked K2-pre as the manual fallback for
   --skip-post-upgrade-verify / --insecure flows that the gate
   skips by design. K1-pre's expected output is synced with the
   actual emit (two-hypothesis hint, waiting line, skip list).

4. cosiLinksDisksReader had a two-valued (snapshot, ok) signature
   that conflated transient COSI errors with auth-disallowed.
   Mirror machineConfigReader's three-valued
   (snapshot, ok, err) shape:

     - err != nil  -> transient (apid timeout, network blip);
       surface underlying cause, the operator's config isn't
       wrong.
     - !ok && err==nil -> auth-disallowed / resource unreachable
       by design; suggest --skip-resource-validation.
     - ok           -> success.

   preflightValidateResources splits the two branches into
   distinct error messages (no more misleading 'config is wrong'
   on a 2s COSI timeout). New regression test
   TestPreflightValidateResources_TransientErr_WrapsUnderlying
   pins the contract.

5. Phase 2C silent no-op when both --nodes and talosconfig
   context's Nodes are empty. Now prints an explanatory line so
   the operator sees the gate they opted into didn't actually
   run.

Recommendations addressed:

- Phase 2C 90s wait now prints a 'waiting...' line up front so
  the operator's terminal isn't a mystery hang.
- targetImage == '' path prints a skip notice instead of silent
  no-op.
- transportNVMe const split into transportNVMe (COSI Disk.Transport
  value) + selectorTypeNVMe (v1alpha1 InstallDiskSelector.type
  enum). They're the same string today; documenting the contract
  separation prevents a silent future regression.
- Phase 2C error message reworded to lead with both hypotheses
  ('auto-rolled back OR still booting'), matching the two-
  hypothesis hint constant.

Documentation reflow (separate, no behaviour change): user pushed
back on hardwrapped prose in human-rendered .md and PR/issue
bodies. Reflowed docs/manual-test-plan.md,
docs/apply-safety-gates-test-plan.md, nit.md, GitHub issue #178,
and PR #173 body to one-line-per-paragraph per CLAUDE.md's prose
rule. Tables, code blocks, and bullet structure preserved
verbatim.

Recommendation NOT addressed in this pass: the
parentDir-shadowing-the-package-constant nit. Pre-existing across
~6 sites; folding it into this PR adds noise. Worth a follow-up
issue (kind/cleanup, area/commands, good first issue).

Refs: #172, #175
Signed-off-by: Aleksei Sviridkin <f@lex.la>
Adds four apply-time safety gates to talm (umbrella #172) plus the
talm init --update non-tty UX fix from #174. The Phase 2C gate also
closes #175 by detecting silent Talos auto-rollback after a failed
cross-vendor upgrade.

# Phase 1 — declared resource existence

Pre-apply walker that scans the rendered MachineConfig for declared
links and disks, then asserts each one resolves against the node's
live COSI snapshot. Catches typos before they hit talosctl. Walker
coverage:

  v1.12 multi-doc form (validates references to existing links):
    - LinkConfig.name
    - DHCPv4Config.name, DHCPv6Config.name, EthernetConfig.name
    - BondConfig.links[], BridgeConfig.links[]
    - VLANConfig.parent
    - Layer2VIPConfig.link, HCloudVIPConfig.link

  v1.12 multi-doc form (validates disk references):
    - UserVolumeConfig.provisioning.diskSelector

  v1.11 nested form:
    - machine.network.interfaces[].interface (link refs)
    - machine.install.disk (literal path)
    - machine.install.diskSelector (size/model/serial/wwid/type
      /busPath filter)
    - machine.disks[].device

Selector semantics mirror Talos's v1alpha1_provider.go disk
resolution: humanize.ParseBytes for size, virtual / readonly /
CDROM exclusion, by-id / by-path stable symlinks accepted. The
type-selector enum (nvme / sd / hdd / ssd) maps to (Transport,
Rotational) per Talos; case handling is intentionally permissive
(predicateType lowercases) — Phase 1 catches declared-resource
mismatches, not Talos's input-format strictness.

Bond / Bridge / VLAN / Wireguard / Dummy / LinkAlias config kinds
are intentionally excluded from name-emission: their .name
describes a NEW resource being created, not a reference to an
existing one. The walker's multidocHandlers dispatch table
distinguishes the two classes explicitly.

Walker is YAML-only — no Talos machinery types in its surface —
so contract tests in pkg/applycheck/refs_test.go run without
standing up a live cluster. Real Talos YAML keys (BridgeConfig.
links not ports, VLANConfig.parent not link) are pinned to test
so a future Talos schema rename surfaces as a red test.

Skip via --skip-resource-validation. Default on.

Hint candidate lists are capped at 10 entries with a "... and N
more" suffix so cozystack hosts with many dm/drbd/loop devices
don't drown the actual finding line.

# Phase 2A — pre-apply drift preview

Reads the node's current MachineConfig via COSI and emits a
+/-/~/= diff against the rendered config that's about to be
applied. Informational — never blocks. Runs on every apply mode
including --dry-run.

Diff is doc-structural, keyed by (Kind, Name) for v1.12 multi-doc
and synthetic MachineConfig{} for v1.11 root. Re-serialization
order is irrelevant (TestDiff_KeyOrderIrrelevant). Slice values
are surfaced as multiset add/remove buckets ("removed [127.0.0.1]"
for certSANs duplicate-cleanup) rather than full-slice dumps;
map / nested values render in YAML flow style ({k: v, k2: v2})
rather than Go's map[k:v k2:v2]. Empty nested maps emit as their
own leaf so an absent / present transition surfaces visibly.

Duplicate (Kind, Name) in a single doc stream is a config defect
(Talos's behaviour on duplicates is unspecified) — parseDocs
surfaces it as an error so the operator sees the real problem
instead of a misleading drift line.

Multi-node output is per-node-distinguished: the header includes
"(node X.Y.Z.W)" when nodeID is non-empty so an operator
scanning stderr from a multi-node apply can tell whose diff is
whose. The auth-template path threads the node through
cosiPreflightContext, which downshifts client.WithNodes (plural,
single-element — needed for template lookup round-tripping) to
client.WithNode (singular — needed because apid's COSI router
rejects the plural metadata key); multi-element ctx is refused
up front.

On insecure / maintenance connection the MachineConfig resource
is Sensitive and unreachable, so the gate degrades gracefully
with a one-line "drift verification unavailable on maintenance
connection" warning.

Skip via --skip-drift-preview. Default on.

# Phase 2B — post-apply state verification

After ApplyConfiguration returns success, re-reads the on-node
MachineConfig and structurally compares against the bytes that
were sent. Divergence blocks with a per-document diff, catching
silent doc drops (Talos parser ignored an unknown field) and
controller reverts.

Mode skips:
  - REBOOT / STAGED / TRY — verify cannot reach a stable
    ActiveID; pinned by shouldRunPostApplyVerify matrix.
  - AUTO — Talos's apply-server promotes AUTO to REBOOT
    internally when the change requires it, so the verify would
    race the reboot. Conservative skip; operators wanting verify
    on no-reboot AUTO applies pass --mode=no-reboot explicitly.
  - --dry-run — no apply happened.

Reader err != nil is a hard blocker (the gate exists to catch
silent rollbacks; swallowing a read error would defeat its
purpose). ok=false err=nil on the auth path degrades to a
maintenance-connection warning.

Skip via --skip-post-apply-verify. Default OFF until the Talos-
mutated-field allowlist lands (#172) — Talos mutates cert hashes
and timestamps post-apply that surface as false-positive
divergence today.

# Phase 2C — post-upgrade version verify

After talosctl upgrade returns success, waits 90s for the node
to finish booting then reads runtime.Version via COSI and
compares the running version's (Major, Minor) contract against
the contract parsed from the target image tag. Cross-minor
mismatch surfaces as a hint-bearing blocker with a two-
hypothesis hint (silent A/B rollback after the new partition
failed its boot readiness check, OR the node is slower than the
reconcile window).

Closes #175: cross-vendor upgrade silent no-op. The upgrade RPC
acks success while Talos has actually rolled back to the
previous partition (cross-vendor image, missing extensions,
failed boot readiness check). The gate catches this by reading
runtime.Version on the auth path post-boot.

Target image is captured BEFORE talosctl's RunE — talosctl's
upgrade flow can overwrite --image with the node's running
install.image (no-op upgrade), which would mask the version
mismatch the gate exists to catch.

Mode skips:
  - --insecure — auth-only COSI path is unreachable; hard
    early-return in shouldRunPostUpgradeVerify.
  - --stage — new partition not yet activated until reboot;
    runtime.Version would always report the old version.

Best-effort surrender on digest-pinned images and unparseable
tags (the gate can't extract a version literal). Reader err is
the rollback signal — a node that hung mid-boot looks like
"connection refused" from COSI; surfaces as a hint-bearing
blocker (TestVerifyPostUpgradeVersion_ReaderConnectionRefused_
NotSilent). versionReader signature is three-valued (string,
bool, error) to distinguish real read failures from by-design
unreachable cases.

Multi-node verify is collect-then-block: every node's verdict
is captured (errors.Join) rather than short-circuiting on the
first failure. talosctl upgrade has already executed on every
node by the time this loop runs, so stopping at the first
failure would hide partial rollbacks.

The 90s reconcile wait is deferred until after node resolution
so an operator with an empty --nodes list doesn't sit through
90s of "waiting for the node to finish booting..." just to be
told there was no target node.

Skip via --skip-post-upgrade-verify. Default on.

# init --update non-tty handling (#174)

Closes #174. Previously, talm init --update under a non-tty
stdin (CI, scripted) hung on an interactive overwrite prompt
that EOF'd into a cryptic error. Now:

  - resolveOverwritePolicy(force, isTTY) is a pure predicate
    returning Ask / Force / NonInteractive. Non-tty without
    --force emits a hint-bearing error pointing the operator
    at --force.
  - stdinIsTTY uses term.IsTerminal instead of the naive
    ModeCharDevice check that falsely accepted /dev/null as a
    tty (and produced the original EOF wedge).
  - init --image runs validateImageRefShape before writing,
    rejecting malformed --image values (e.g. "::malformed",
    "no-slash:tag") up front instead of leaving a broken
    values.yaml that fails later inside configloader.

# Implementation notes

Engine package is unchanged. All gate logic lives in
pkg/applycheck (walker, validator, differ, selector) and
pkg/commands/preflight_apply_safety.go (Phase 1/2A/2B hooks +
COSI readers) plus pkg/commands/preflight_upgrade_verify.go
(Phase 2C).

The three readers (linksDisksReader, machineConfigReader,
versionReader) share the (snapshot, ok, err) three-valued
contract for test-fake symmetry, though the policy on each
output combination differs per gate.

cosiLinksDisksReader gives each ListAll call its own
context.WithTimeout budget — a shared budget would leak the
first read's elapsed time into the second's deadline and
produce false transient-timeout blockers on slow nodes.

multisetDiff's equality key is fmt.Sprintf("%v", item) wrapped
behind canonicalKey() — fmt sorts map keys for stable output
(documented in the fmt package), and the wrapper isolates the
contract for a future swap. Pinned by a 100-iteration
stability test.

# Documentation

  - README's "Apply-time safety gates" section describes all
    four gates with their skip-flag defaults.
  - docs/apply-safety-gates-test-plan.md — narrow QA reference
    for the gates: matrix of mode × skip-flag × insecure ×
    --dry-run for every gate, real-Talos validation steps.
  - docs/manual-test-plan.md — broader end-to-end QA matrix
    covering the whole talm CLI surface.
  - golangci-lint timeout bumped to 10m in pr.yml to
    accommodate the additional package's analysis time.

Test surface includes red+green pairs for every contract-level
predicate (TDD per project rule) plus integration tests for the
preflight hooks via function-type fakes. Walker, differ,
validator, selector each have their own contract test file.
TestNoWorkflowLeakageInRepoSource is extended to ban private
operator identifiers from committed source.

Closes #172
Closes #174
Closes #175

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
@lexfrei Aleksei Sviridkin (lexfrei) marked this pull request as ready for review May 12, 2026 08:07

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/commands/init.go`:
- Around line 1089-1092: The current error wrapping around
askUserOverwrite(filePath, policy) replaces the original error (e.g.,
overwritePolicyNonInteractive) with "failed to read user input", which is
misleading for non-interactive flows; update the error handling in the caller of
askUserOverwrite to preserve the original error context: detect or allow the
sentinel error returned by askUserOverwrite (such as
overwritePolicyNonInteractive) to be returned directly (or wrapped with a
message that includes the original error), and only use "failed to read user
input" for genuine stdin read failures; reference askUserOverwrite,
overwritePolicyNonInteractive, filePath and policy when making this change so
the non-tty path propagates its true error.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6f332d6f-b787-4a4f-a1c5-c4ba2e2968ba

📥 Commits

Reviewing files that changed from the base of the PR and between ed3b820 and 7868946.

📒 Files selected for processing (27)
  • .github/workflows/pr.yml
  • README.md
  • docs/apply-safety-gates-test-plan.md
  • docs/manual-test-plan.md
  • go.mod
  • pkg/applycheck/diff.go
  • pkg/applycheck/diff_test.go
  • pkg/applycheck/doc.go
  • pkg/applycheck/refs.go
  • pkg/applycheck/refs_test.go
  • pkg/applycheck/selector.go
  • pkg/applycheck/validate.go
  • pkg/applycheck/validate_test.go
  • pkg/commands/apply.go
  • pkg/commands/apply_test.go
  • pkg/commands/init.go
  • pkg/commands/init_overwrite_test.go
  • pkg/commands/init_test.go
  • pkg/commands/preflight.go
  • pkg/commands/preflight_apply_safety.go
  • pkg/commands/preflight_apply_safety_test.go
  • pkg/commands/preflight_test.go
  • pkg/commands/preflight_upgrade_verify.go
  • pkg/commands/preflight_upgrade_verify_test.go
  • pkg/commands/upgrade_handler.go
  • pkg/commands/upgrade_handler_test.go
  • pkg/engine/engine_test.go

Comment thread pkg/commands/init.go
Comment on lines +1089 to 1092
overwrite, err := askUserOverwrite(filePath, policy)
if err != nil {
return errors.Wrap(err, "failed to read user input")
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Preserve policy error context instead of forcing a “read input” message.

When overwritePolicyNonInteractive is hit, no stdin read occurs, but Line 1091 rewrites that path to "failed to read user input". This makes non-tty failures misleading.

Suggested fix
 	overwrite, err := askUserOverwrite(filePath, policy)
 	if err != nil {
-		return errors.Wrap(err, "failed to read user input")
+		return err
 	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
overwrite, err := askUserOverwrite(filePath, policy)
if err != nil {
return errors.Wrap(err, "failed to read user input")
}
overwrite, err := askUserOverwrite(filePath, policy)
if err != nil {
return err
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/commands/init.go` around lines 1089 - 1092, The current error wrapping
around askUserOverwrite(filePath, policy) replaces the original error (e.g.,
overwritePolicyNonInteractive) with "failed to read user input", which is
misleading for non-interactive flows; update the error handling in the caller of
askUserOverwrite to preserve the original error context: detect or allow the
sentinel error returned by askUserOverwrite (such as
overwritePolicyNonInteractive) to be returned directly (or wrapped with a
message that includes the original error), and only use "failed to read user
input" for genuine stdin read failures; reference askUserOverwrite,
overwritePolicyNonInteractive, filePath and policy when making this change so
the non-tty path propagates its true error.

@lexfrei Aleksei Sviridkin (lexfrei) merged commit a00581e into main May 12, 2026
8 checks passed
Aleksei Sviridkin (lexfrei) added a commit that referenced this pull request May 13, 2026
Four operator-UX polish changes on the safety surfaces introduced
by #173.

* #191 — Per-node prefix on the maintenance-connection warning.
  Multi-node insecure apply used to print identical bare
  'talm: drift verification unavailable on maintenance connection'
  lines for every node with no per-node correlation. Both the
  previewDrift and verifyAppliedState emission sites now wrap
  through nodePrefix(nodeID); single-node case (empty nodeID)
  keeps the bare line so the common UX is unchanged.

  cosiPreflightContext got a fallback path to GlobalArgs.Nodes[0]
  when no outgoing-context metadata is attached — the maintenance
  flow (openClientPerNodeMaintenance) pins the node in GlobalArgs
  but does not attach context metadata, so without the fallback
  the per-node prefix collapsed to empty even when --nodes was
  set explicitly. Fallback is single-element-only; multi-element
  GlobalArgs.Nodes returns empty so a broken caller surfaces as
  'no prefix' rather than 'wrong node prefix'.

* #190 — Configurable post-upgrade reconcile window.
  The hardcoded 90s wait after talosctl upgrade returns is now
  --post-upgrade-reconcile-window. Default
  defaultPostUpgradeReconcileWindow = 90 * time.Second preserves
  byte-identical back-compat. validatePostUpgradeReconcileWindow
  rejects non-positive values at the TOP of wrapUpgradeCommand
  RunE — fail-fast BEFORE any talosctl RPC fires, so an
  operator's '=0s' typo cannot land a partial upgrade. The
  version-mismatch hint copy and error body de-hardcode the '90s'
  literal to reference 'the configured reconcile window'
  (defaultPostUpgradeReconcileWindow flows through
  verifyPostUpgradeVersion).

* #192 — Syntactic net-addr walker for three v1alpha1 multidoc
  kinds. Validates fields the actual Talos schema emits:
    * StaticHostConfig.name (IP literal; the `name` field on
      this kind doubles as the IP — there is no separate
      `address` field).
    * NetworkRuleConfig.ingress[].subnet and .except (CIDR
      shapes; no top-level matchSourceAddress[] field exists
      in Talos's schema).
    * WireguardConfig.peers[].endpoint (host:port; empty /
      absent endpoint is a listener-only peer and NOT a
      finding).
  Walker runs in parallel with the Ref-based walker via
  multidocNetAddrHandlers; dispatch-table disjointness with
  multidocHandlers is pinned by a unit test. Integration tests
  verify Phase 1 blocks on bad and passes on valid.

* #189 — Drift preview redacts secret-bearing field values by
  default. secretFieldPaths allowlist with bracket-normalisation
  regex; secret check fires BEFORE bothSlices in
  formatFieldChangeLine so slice-shaped allowlist entries
  (cluster.acceptedCAs, machine.acceptedCAs, peers) never leak
  via formatSliceSetDiff. Length-disclosing redaction sentinel
  (***redacted (len=N)***) preserves rotation signal without
  leaking the value. Non-string secret values route through
  redactValue(fmt.Sprintf("%v", value)) so int / bool rotations
  still surface as different-length sentinels. Map iteration
  order non-determinism is disclaimed in the godoc (no
  map-shaped allowlist entry today).

  New flag --show-secrets-in-drift (default off) plumbed through
  previewDrift / verifyAppliedState / printDriftPreview /
  formatFieldChangeLine. apply.go callers pass
  applyCmdFlags.showSecretsInDrift.

  absentFieldValue const hoisted so formatFieldValue and
  formatSecretFieldValue stay byte-identical on the absent path
  — add/remove vs rotate stays distinguishable.

Doc updates:
  * README.md bullet #1 (Phase 1) now mentions the net-addr
    walker; bullet #2 (Phase 2A) mentions redaction +
    --show-secrets-in-drift + the per-node prefix; bullet #4
    (Phase 2C) says 'the configured reconcile window' instead
    of hardcoding 90s.
  * docs/manual-test-plan.md gains sections C5/C6/C7 (redact,
    flag opt-out, walker), D3 (per-node prefix), E3 (reconcile
    window with help-text + 0s rejection), M6 (false-positive
    guard for redaction), M7 (walker boundary cases). Forward-
    looking 'do X, expect Y' shape with explicit regression
    anchors.
  * docs/apply-safety-gates-test-plan.md gets matching rows in
    Phase 1 (net-addr walker), Phase 2A (redaction +
    per-node prefix), and Phase 2C (reconcile window) tables.

Test coverage:
  * Walker: per-kind table tests with the real schema shape,
    plus TestWalkNetAddrFindings_RealSchema_StaticHostConfig
    and ..._NetworkRuleConfig as schema-anchor regression pins
    so a future drift back to non-existent fields fails these
    tests rather than passing trivially.
  * Reconcile window: validator boundary cases, flag default
    one-(default)-clause pin, fail-fast ordering pin via a
    sentinel originalRunE, hint-text no-hardcoded-90s pin, a
    README-no-hardcoded-90s pin so README and code stay in
    sync.
  * Per-node prefix: maintenance message on both emission
    sites, empty-nodeID bare-line guard, single-vs-multi
    GlobalArgs.Nodes fallback pair.
  * Redaction: exact-match, Wireguard paths, slice-rotation
    no-leak (CA-list and peers shapes), bracket-normalisation,
    false-prefix guard, non-secret control, opt-in flag,
    non-string redaction with length-signal, absent-side
    add-vs-rotate distinction.

Closes #189, closes #190, closes #191, closes #192.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
Aleksei Sviridkin (lexfrei) added a commit that referenced this pull request May 13, 2026
Four operator-UX polish changes on the safety surfaces introduced
by #173.

* #191 — Per-node prefix on the maintenance-connection warning.
  Multi-node insecure apply used to print identical bare
  'talm: drift verification unavailable on maintenance connection'
  lines for every node with no per-node correlation. Both the
  previewDrift and verifyAppliedState emission sites now wrap
  through nodePrefix(nodeID); single-node case (empty nodeID)
  keeps the bare line so the common UX is unchanged.

  cosiPreflightContext got a fallback path to GlobalArgs.Nodes[0]
  when no outgoing-context metadata is attached — the maintenance
  flow (openClientPerNodeMaintenance) pins the node in GlobalArgs
  but does not attach context metadata, so without the fallback
  the per-node prefix collapsed to empty even when --nodes was
  set explicitly. Fallback is single-element-only; multi-element
  GlobalArgs.Nodes returns empty so a broken caller surfaces as
  'no prefix' rather than 'wrong node prefix'.

* #190 — Configurable post-upgrade reconcile window.
  The hardcoded 90s wait after talosctl upgrade returns is now
  --post-upgrade-reconcile-window. Default
  defaultPostUpgradeReconcileWindow = 90 * time.Second preserves
  byte-identical back-compat. validatePostUpgradeReconcileWindow
  rejects non-positive values at the TOP of wrapUpgradeCommand
  RunE — fail-fast BEFORE any talosctl RPC fires, so an
  operator's '=0s' typo cannot land a partial upgrade. The
  version-mismatch hint copy and error body de-hardcode the '90s'
  literal to reference 'the configured reconcile window'
  (defaultPostUpgradeReconcileWindow flows through
  verifyPostUpgradeVersion).

* #192 — Syntactic net-addr walker for three v1alpha1 multidoc
  kinds. Validates fields the actual Talos schema emits:
    * StaticHostConfig.name (IP literal; the `name` field on
      this kind doubles as the IP — there is no separate
      `address` field).
    * NetworkRuleConfig.ingress[].subnet and .except (CIDR
      shapes; no top-level matchSourceAddress[] field exists
      in Talos's schema).
    * WireguardConfig.peers[].endpoint (host:port; empty /
      absent endpoint is a listener-only peer and NOT a
      finding).
  Walker runs in parallel with the Ref-based walker via
  multidocNetAddrHandlers; dispatch-table disjointness with
  multidocHandlers is pinned by a unit test. Integration tests
  verify Phase 1 blocks on bad and passes on valid.

* #189 — Drift preview redacts secret-bearing field values by
  default. secretFieldPaths allowlist with bracket-normalisation
  regex; secret check fires BEFORE bothSlices in
  formatFieldChangeLine so slice-shaped allowlist entries
  (cluster.acceptedCAs, machine.acceptedCAs, peers) never leak
  via formatSliceSetDiff. Length-disclosing redaction sentinel
  (***redacted (len=N)***) preserves rotation signal without
  leaking the value. Non-string secret values route through
  redactValue(fmt.Sprintf("%v", value)) so int / bool rotations
  still surface as different-length sentinels. Map iteration
  order non-determinism is disclaimed in the godoc (no
  map-shaped allowlist entry today).

  New flag --show-secrets-in-drift (default off) plumbed through
  previewDrift / verifyAppliedState / printDriftPreview /
  formatFieldChangeLine. apply.go callers pass
  applyCmdFlags.showSecretsInDrift.

  absentFieldValue const hoisted so formatFieldValue and
  formatSecretFieldValue stay byte-identical on the absent path
  — add/remove vs rotate stays distinguishable.

Doc updates:
  * README.md bullet #1 (Phase 1) now mentions the net-addr
    walker; bullet #2 (Phase 2A) mentions redaction +
    --show-secrets-in-drift + the per-node prefix; bullet #4
    (Phase 2C) says 'the configured reconcile window' instead
    of hardcoding 90s.
  * docs/manual-test-plan.md gains sections C5/C6/C7 (redact,
    flag opt-out, walker), D3 (per-node prefix), E3 (reconcile
    window with help-text + 0s rejection), M6 (false-positive
    guard for redaction), M7 (walker boundary cases). Forward-
    looking 'do X, expect Y' shape with explicit regression
    anchors.
  * docs/apply-safety-gates-test-plan.md gets matching rows in
    Phase 1 (net-addr walker), Phase 2A (redaction +
    per-node prefix), and Phase 2C (reconcile window) tables.

Test coverage:
  * Walker: per-kind table tests with the real schema shape,
    plus TestWalkNetAddrFindings_RealSchema_StaticHostConfig
    and ..._NetworkRuleConfig as schema-anchor regression pins
    so a future drift back to non-existent fields fails these
    tests rather than passing trivially.
  * Reconcile window: validator boundary cases, flag default
    one-(default)-clause pin, fail-fast ordering pin via a
    sentinel originalRunE, hint-text no-hardcoded-90s pin, a
    README-no-hardcoded-90s pin so README and code stay in
    sync.
  * Per-node prefix: maintenance message on both emission
    sites, empty-nodeID bare-line guard, single-vs-multi
    GlobalArgs.Nodes fallback pair.
  * Redaction: exact-match, Wireguard paths, slice-rotation
    no-leak (CA-list and peers shapes), bracket-normalisation,
    false-prefix guard, non-secret control, opt-in flag,
    non-string redaction with length-signal, absent-side
    add-vs-rotate distinction.

Closes #189, closes #190, closes #191, closes #192.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
Aleksei Sviridkin (lexfrei) added a commit that referenced this pull request May 13, 2026
Four operator-UX polish changes on the safety surfaces introduced
by #173.

* #191 — Per-node prefix on the maintenance-connection warning.
  Multi-node insecure apply used to print identical bare
  'talm: drift verification unavailable on maintenance connection'
  lines for every node with no per-node correlation. Both the
  previewDrift and verifyAppliedState emission sites now wrap
  through nodePrefix(nodeID); single-node case (empty nodeID)
  keeps the bare line so the common UX is unchanged.

  cosiPreflightContext got a fallback path to GlobalArgs.Nodes[0]
  when no outgoing-context metadata is attached — the maintenance
  flow (openClientPerNodeMaintenance) pins the node in GlobalArgs
  but does not attach context metadata, so without the fallback
  the per-node prefix collapsed to empty even when --nodes was
  set explicitly. Fallback is single-element-only; multi-element
  GlobalArgs.Nodes returns empty so a broken caller surfaces as
  'no prefix' rather than 'wrong node prefix'.

* #190 — Configurable post-upgrade reconcile window.
  The hardcoded 90s wait after talosctl upgrade returns is now
  --post-upgrade-reconcile-window. Default
  defaultPostUpgradeReconcileWindow = 90 * time.Second preserves
  byte-identical back-compat. validatePostUpgradeReconcileWindow
  rejects non-positive values at the TOP of wrapUpgradeCommand
  RunE — fail-fast BEFORE any talosctl RPC fires, so an
  operator's '=0s' typo cannot land a partial upgrade. The
  version-mismatch hint copy and error body de-hardcode the '90s'
  literal to reference 'the configured reconcile window'
  (defaultPostUpgradeReconcileWindow flows through
  verifyPostUpgradeVersion).

* #192 — Syntactic net-addr walker for three v1alpha1 multidoc
  kinds. Validates fields the actual Talos schema emits:
    * StaticHostConfig.name (IP literal; the `name` field on
      this kind doubles as the IP — there is no separate
      `address` field).
    * NetworkRuleConfig.ingress[].subnet and .except (CIDR
      shapes; no top-level matchSourceAddress[] field exists
      in Talos's schema).
    * WireguardConfig.peers[].endpoint (host:port; empty /
      absent endpoint is a listener-only peer and NOT a
      finding).
  Walker runs in parallel with the Ref-based walker via
  multidocNetAddrHandlers; dispatch-table disjointness with
  multidocHandlers is pinned by a unit test. Integration tests
  verify Phase 1 blocks on bad and passes on valid.

* #189 — Drift preview redacts secret-bearing field values by
  default. secretFieldPaths allowlist with bracket-normalisation
  regex; secret check fires BEFORE bothSlices in
  formatFieldChangeLine so slice-shaped allowlist entries
  (cluster.acceptedCAs, machine.acceptedCAs, peers) never leak
  via formatSliceSetDiff. Length-disclosing redaction sentinel
  (***redacted (len=N)***) preserves rotation signal without
  leaking the value. Non-string secret values route through
  redactValue(fmt.Sprintf("%v", value)) so int / bool rotations
  still surface as different-length sentinels. Map iteration
  order non-determinism is disclaimed in the godoc (no
  map-shaped allowlist entry today).

  New flag --show-secrets-in-drift (default off) plumbed through
  previewDrift / verifyAppliedState / printDriftPreview /
  formatFieldChangeLine. apply.go callers pass
  applyCmdFlags.showSecretsInDrift.

  absentFieldValue const hoisted so formatFieldValue and
  formatSecretFieldValue stay byte-identical on the absent path
  — add/remove vs rotate stays distinguishable.

Doc updates:
  * README.md bullet #1 (Phase 1) now mentions the net-addr
    walker; bullet #2 (Phase 2A) mentions redaction +
    --show-secrets-in-drift + the per-node prefix; bullet #4
    (Phase 2C) says 'the configured reconcile window' instead
    of hardcoding 90s.
  * docs/manual-test-plan.md gains sections C5/C6/C7 (redact,
    flag opt-out, walker), D3 (per-node prefix), E3 (reconcile
    window with help-text + 0s rejection), M6 (false-positive
    guard for redaction), M7 (walker boundary cases). Forward-
    looking 'do X, expect Y' shape with explicit regression
    anchors.
  * docs/apply-safety-gates-test-plan.md gets matching rows in
    Phase 1 (net-addr walker), Phase 2A (redaction +
    per-node prefix), and Phase 2C (reconcile window) tables.

Test coverage:
  * Walker: per-kind table tests with the real schema shape,
    plus TestWalkNetAddrFindings_RealSchema_StaticHostConfig
    and ..._NetworkRuleConfig as schema-anchor regression pins
    so a future drift back to non-existent fields fails these
    tests rather than passing trivially.
  * Reconcile window: validator boundary cases, flag default
    one-(default)-clause pin, fail-fast ordering pin via a
    sentinel originalRunE, hint-text no-hardcoded-90s pin, a
    README-no-hardcoded-90s pin so README and code stay in
    sync.
  * Per-node prefix: maintenance message on both emission
    sites, empty-nodeID bare-line guard, single-vs-multi
    GlobalArgs.Nodes fallback pair.
  * Redaction: exact-match, Wireguard paths, slice-rotation
    no-leak (CA-list and peers shapes), bracket-normalisation,
    false-prefix guard, non-secret control, opt-in flag,
    non-string redaction with length-signal, absent-side
    add-vs-rotate distinction.

Closes #189, closes #190, closes #191, closes #192.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants