Pre-requisites
What happened? What did you expect to happen?
Summary
When a CronWorkflow's run() completes, persistUpdate() applies a MergePatch that includes the full metadata.labels and metadata.annotations maps from its in-memory snapshot. If any external actor (e.g., a deployment tool, another controller, or a manual kubectl patch) modifies the CronWorkflow's labels between the time the informer cache is read and persistUpdate() executes, those changes are silently overwritten.
Root Cause
1. Stale snapshot from informer cache
When a cron schedule fires, the controller reads the CronWorkflow from the informer's local cache, not from the API server directly:
// controller.go:157
obj, _, _ := cc.cronWfInformer.Informer().GetIndexer().GetByKey(key)
This object is then copied into a Go struct and held in memory for the entire duration of run():
// controller.go:173-174
cronWf := &v1alpha1.CronWorkflow{}
util.FromUnstructuredObj(un, cronWf)
// controller.go:182
cronWorkflowOperationCtx := newCronWfOperationCtx(ctx, cronWf, ...)
From this point on, woc.cronWf.Labels is a frozen snapshot. It is never refreshed.
2. persistUpdate() patches metadata unnecessarily
// operator.go:175-176
func (woc *cronWfOperationCtx) persistUpdate(ctx context.Context) {
woc.patch(ctx, map[string]any{
"status": woc.cronWf.Status,
"metadata": map[string]any{
"annotations": woc.cronWf.Annotations,
"labels": woc.cronWf.Labels, // stale snapshot
},
})
}
This method is called via defer at the end of every run() invocation (operator.go:95). The patch type is types.MergePatchType, which means the stale label map replaces whatever is currently in etcd; it does not merge individual keys.
3. The only labels that actually need persisting
Reviewing the code, only two cases require writing metadata back:
| Case |
Field |
When |
| Schedule annotation |
metadata.annotations["workflows.argoproj.io/last-used-schedule"] |
When IsUsingNewSchedule() is true |
| Completion label |
metadata.labels["workflows.argoproj.io/cron-workflow-completed"] |
When setAsCompleted() is called |
General-purpose labels (e.g., app, deploy-version, team) do not need to be persisted by the cron controller, yet they are included in every patch.
Race Window
Time -------------------------------------------------------------->
Controller: ① Read from informer cache (labels=v1, snapshot frozen)
|
② Submit child Workflow
|
③ Update in-memory status fields
|
④ persistUpdate() -> MergePatch labels=v1 to etcd
|
+--- race window: ① to ④ -----------------------------+
External: kubectl patch labels=v2 (written to etcd)
|
+--- overwritten by ④ -> labels revert to v1
The race window spans from the informer cache read (step ①) to the persistUpdate() call (step ④). Any external label modification during this window is silently lost.
In practice, the window is short (tens to hundreds of milliseconds), but:
- CronWorkflows that run every minute create 1,440 race opportunities per day
- API server load can stretch the
SubmitWorkflow call, widening the window
- The informer cache itself has propagation delay from etcd, extending the window further before step ①
Impact
- Label-based pruning tools (e.g., deploy version selectors) may incorrectly identify a CronWorkflow as belonging to an old version and delete it
- GitOps controllers that rely on labels for ownership tracking may enter reconciliation loops
- Any external system that patches CronWorkflow labels can have its changes silently reverted
Steps to Reproduce
- Create a CronWorkflow with
metadata.labels.deploy-version: v1 and a * * * * * schedule
- Wait for a cron trigger
- While
run() is executing (to widen the window, add a time.Sleep(30s) before the closing brace of run()), patch the label:
kubectl patch cronworkflow <name> --type merge \
-p '{"metadata":{"labels":{"deploy-version":"v2"}}}'
- After
persistUpdate() completes, observe that deploy-version has reverted to v1
Environment
- Argo Workflows version: latest
main branch
- Relevant source:
workflow/cron/operator.go:170-176, workflow/cron/controller.go:157-182
Version(s)
v3.7.10, latest
Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflow that uses private images.
apiVersion: argoproj.io/v1alpha1
kind: CronWorkflow
metadata:
name: metadata-race-test
labels:
test-label: "initial"
spec:
schedule: "* * * * *"
concurrencyPolicy: "Forbid"
workflowSpec:
entrypoint: main
templates:
- name: main
container:
image: busybox
command: ["sh", "-c"]
args: ["sleep 10"]
Logs from the workflow controller
Logs from in your workflow's wait container
Pre-requisites
:latestimage tag (i.e.quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on:latest. If not, I have explained why, in detail, in my description below.What happened? What did you expect to happen?
Summary
When a CronWorkflow's
run()completes,persistUpdate()applies aMergePatchthat includes the fullmetadata.labelsandmetadata.annotationsmaps from its in-memory snapshot. If any external actor (e.g., a deployment tool, another controller, or a manualkubectl patch) modifies the CronWorkflow's labels between the time the informer cache is read andpersistUpdate()executes, those changes are silently overwritten.Root Cause
1. Stale snapshot from informer cache
When a cron schedule fires, the controller reads the CronWorkflow from the informer's local cache, not from the API server directly:
This object is then copied into a Go struct and held in memory for the entire duration of
run():From this point on,
woc.cronWf.Labelsis a frozen snapshot. It is never refreshed.2.
persistUpdate()patches metadata unnecessarilyThis method is called via
deferat the end of everyrun()invocation (operator.go:95). The patch type istypes.MergePatchType, which means the stale label map replaces whatever is currently in etcd; it does not merge individual keys.3. The only labels that actually need persisting
Reviewing the code, only two cases require writing metadata back:
metadata.annotations["workflows.argoproj.io/last-used-schedule"]IsUsingNewSchedule()is truemetadata.labels["workflows.argoproj.io/cron-workflow-completed"]setAsCompleted()is calledGeneral-purpose labels (e.g.,
app,deploy-version,team) do not need to be persisted by the cron controller, yet they are included in every patch.Race Window
The race window spans from the informer cache read (step ①) to the
persistUpdate()call (step ④). Any external label modification during this window is silently lost.In practice, the window is short (tens to hundreds of milliseconds), but:
SubmitWorkflowcall, widening the windowImpact
Steps to Reproduce
metadata.labels.deploy-version: v1and a* * * * *schedulerun()is executing (to widen the window, add atime.Sleep(30s)before the closing brace ofrun()), patch the label:persistUpdate()completes, observe thatdeploy-versionhas reverted tov1Environment
mainbranchworkflow/cron/operator.go:170-176,workflow/cron/controller.go:157-182Version(s)
v3.7.10, latest
Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflow that uses private images.
Logs from the workflow controller
Logs from in your workflow's wait container