Skip to content

CronWorkflow controller may overwrite newer metadata.labels/annotations with stale state during persistUpdate() #15877

@GeunSam2

Description

@GeunSam2

Pre-requisites

  • I have double-checked my configuration
  • I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
  • I have searched existing issues and could not find a match for this bug
  • I'd like to contribute the fix myself (see contributing guide)

What happened? What did you expect to happen?

Summary

When a CronWorkflow's run() completes, persistUpdate() applies a MergePatch that includes the full metadata.labels and metadata.annotations maps from its in-memory snapshot. If any external actor (e.g., a deployment tool, another controller, or a manual kubectl patch) modifies the CronWorkflow's labels between the time the informer cache is read and persistUpdate() executes, those changes are silently overwritten.

Root Cause

1. Stale snapshot from informer cache

When a cron schedule fires, the controller reads the CronWorkflow from the informer's local cache, not from the API server directly:

// controller.go:157
obj, _, _ := cc.cronWfInformer.Informer().GetIndexer().GetByKey(key)

This object is then copied into a Go struct and held in memory for the entire duration of run():

// controller.go:173-174
cronWf := &v1alpha1.CronWorkflow{}
util.FromUnstructuredObj(un, cronWf)

// controller.go:182
cronWorkflowOperationCtx := newCronWfOperationCtx(ctx, cronWf, ...)

From this point on, woc.cronWf.Labels is a frozen snapshot. It is never refreshed.

2. persistUpdate() patches metadata unnecessarily

// operator.go:175-176
func (woc *cronWfOperationCtx) persistUpdate(ctx context.Context) {
    woc.patch(ctx, map[string]any{
        "status": woc.cronWf.Status,
        "metadata": map[string]any{
            "annotations": woc.cronWf.Annotations,
            "labels":      woc.cronWf.Labels,  // stale snapshot
        },
    })
}

This method is called via defer at the end of every run() invocation (operator.go:95). The patch type is types.MergePatchType, which means the stale label map replaces whatever is currently in etcd; it does not merge individual keys.

3. The only labels that actually need persisting

Reviewing the code, only two cases require writing metadata back:

Case Field When
Schedule annotation metadata.annotations["workflows.argoproj.io/last-used-schedule"] When IsUsingNewSchedule() is true
Completion label metadata.labels["workflows.argoproj.io/cron-workflow-completed"] When setAsCompleted() is called

General-purpose labels (e.g., app, deploy-version, team) do not need to be persisted by the cron controller, yet they are included in every patch.

Race Window

Time -------------------------------------------------------------->

Controller:  ① Read from informer cache (labels=v1, snapshot frozen)
             |
             ② Submit child Workflow
             |
             ③ Update in-memory status fields
             |
             ④ persistUpdate() -> MergePatch labels=v1 to etcd
             |
             +--- race window: ① to ④ -----------------------------+

External:         kubectl patch labels=v2 (written to etcd)
                  |
                  +--- overwritten by ④ -> labels revert to v1

The race window spans from the informer cache read (step ①) to the persistUpdate() call (step ④). Any external label modification during this window is silently lost.

In practice, the window is short (tens to hundreds of milliseconds), but:

  • CronWorkflows that run every minute create 1,440 race opportunities per day
  • API server load can stretch the SubmitWorkflow call, widening the window
  • The informer cache itself has propagation delay from etcd, extending the window further before step ①

Impact

  • Label-based pruning tools (e.g., deploy version selectors) may incorrectly identify a CronWorkflow as belonging to an old version and delete it
  • GitOps controllers that rely on labels for ownership tracking may enter reconciliation loops
  • Any external system that patches CronWorkflow labels can have its changes silently reverted

Steps to Reproduce

  1. Create a CronWorkflow with metadata.labels.deploy-version: v1 and a * * * * * schedule
  2. Wait for a cron trigger
  3. While run() is executing (to widen the window, add a time.Sleep(30s) before the closing brace of run()), patch the label:
kubectl patch cronworkflow <name> --type merge \
  -p '{"metadata":{"labels":{"deploy-version":"v2"}}}'
  1. After persistUpdate() completes, observe that deploy-version has reverted to v1

Environment

  • Argo Workflows version: latest main branch
  • Relevant source: workflow/cron/operator.go:170-176, workflow/cron/controller.go:157-182

Version(s)

v3.7.10, latest

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflow that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: CronWorkflow
metadata:
  name: metadata-race-test
  labels:
    test-label: "initial"
spec:
  schedule: "* * * * *"
  concurrencyPolicy: "Forbid"
  workflowSpec:
    entrypoint: main
    templates:
      - name: main
        container:
          image: busybox
          command: ["sh", "-c"]
          args: ["sleep 10"]

Logs from the workflow controller

-

Logs from in your workflow's wait container

-

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions