Skip to content

controller: support namespace-scoped cache via WATCH_NAMESPACE#766

Open
zjh7890 wants to merge 2 commits intoalibaba:mainfrom
zjh7890:feat/namespace-scoped-cache
Open

controller: support namespace-scoped cache via WATCH_NAMESPACE#766
zjh7890 wants to merge 2 commits intoalibaba:mainfrom
zjh7890:feat/namespace-scoped-cache

Conversation

@zjh7890
Copy link
Copy Markdown

@zjh7890 zjh7890 commented Apr 21, 2026

Motivation

Both BatchSandboxReconciler and PoolReconciler declare Owns(&corev1.Pod{}) in SetupWithManager. In cmd/controller/main.go, the manager is created without Cache.DefaultNamespaces, which means the underlying controller-runtime shared informer for Pods has cluster-wide scope.

This is controller-runtime's documented default and isn't wrong on its own. However, on a large shared cluster that hosts many unrelated workloads alongside OpenSandbox:

  • The shared Pod informer performs a full list+watch of every Pod in the cluster on startup.
  • The Pod.ownerRefUID field index (internal/utils/fieldindex) is built over that entire set.
  • Cold-start memory usage grows with cluster-wide Pod count, not with the number of OpenSandbox-owned Pods, which can significantly increase startup memory pressure on multi-tenant clusters.

Observed symptom on a shared multi-tenant cluster where only one namespace contained OpenSandbox CRs:

  • opensandbox-controller-manager crash-looped with exitCode=137 / reason=OOMKilled within ~5s of startup. Logs only got as far as register field index / starting manager / starting server name=health probe before the kernel killed the process on cgroup memory limit.
  • Raising the Pod memory limit helped but is a band-aid — headroom still has to cover unrelated cluster growth.

Why this is safe to narrow

The current reconcile code already treats BatchSandbox/Pool/Pod as a same-namespace graph, so restricting the cache to the namespaces that actually contain OpenSandbox CRs does not remove any existing capability:

So the cluster-wide cache is paying a cost that none of the reconcile paths actually consume.

Change

Add an opt-in flag to narrow the manager cache to a specific set of namespaces.

  • New flag --watch-namespaces (comma-separated list).
  • Defaults to the WATCH_NAMESPACE environment variable (common operator convention).
  • When non-empty, wires into cache.Options.DefaultNamespaces, producing a multi-namespace cache.
  • When empty (or unset), behavior is unchanged — the manager stays cluster-scoped, preserving backward compatibility.

Parser is fail-closed, not fail-open

The flag's purpose is least-privilege / smaller cache, so malformed input must not silently widen scope back to cluster-wide. buildWatchNamespaces rejects any malformed input with a clear error and the controller exits non-zero at startup:

Input Behavior
"" (empty) cluster-wide (explicit opt-out)
"ns-a" / "ns-a,ns-b" namespace-scoped
" ns-a , ns-b " trimmed, namespace-scoped
"ns-a,ns-a" de-duplicated to {ns-a}
",", " , ", "ns-a,", ",ns-a", "ns-a,,ns-b" error, exit 1 (empty segment)
"Default", "bad_ns", 64-char name error, exit 1 (fails DNS-1123 label validation via k8s.io/apimachinery/pkg/util/validation.IsDNS1123Label)

The startup log also sorts the namespace list before printing so the line is deterministic across restarts.

Example:

# Deployment snippet
env:
  - name: WATCH_NAMESPACE
    value: "opensandbox"

or:

sandbox-controller --watch-namespaces=opensandbox,another-ns

Log line on startup:

restricting manager cache to specific namespaces  namespaces=[opensandbox]

vs. unset:

manager cache is cluster-scoped (watching all namespaces)

Compatibility & risk

  • Flag defaults to empty → no behavior change for existing users.
  • No CRD / RBAC / Helm chart changes in this PR. Operators who opt in can tighten RBAC from ClusterRole to Role/RoleBinding in a follow-up; that is not required for the cache change itself to take effect.
  • Owns(&corev1.Pod{}) is unchanged. Pool / BatchSandbox reconcile logic is unchanged.
  • cache.Options.DefaultNamespaces is compatible with the existing Metrics / WebhookServer manager options and with fieldindex.RegisterFieldIndexesmultiNamespaceCache supports IndexField.
  • Note for future contributors: if webhook handlers are later added that need to read across namespaces via the manager cache, those webhooks would need to be scoped consistently with the configured watch-namespaces.
  • Cost model caveat: DefaultNamespaces creates one cache/informer per namespace, so this flag is intended for "a small number of namespaces replaces one cluster-wide cache", not for arbitrarily long lists.

Testing

  • go build ./cmd/controller/... passes.

  • go vet ./cmd/controller/... passes.

  • New table-driven unit test cmd/controller/main_test.go covers 15 cases: empty, whitespace-only, single, multi, trim, duplicates, mid-empty / leading-empty / trailing-empty / single-comma / whitespace-between-commas (all rejected), uppercase (rejected), underscore (rejected), 64-char overlong (rejected), 63-char max-length (accepted). All pass under go test -count=1 -v ./cmd/controller/....

  • --help shows the new flag:

    -watch-namespaces string
        Comma-separated list of namespaces the manager should watch.
        Empty (default) watches all namespaces; providing a list restricts
        the shared informer cache to only those namespaces. Also reads the
        WATCH_NAMESPACE environment variable.
    

Follow-ups (not in this PR, happy to do separately if accepted)

  • Document the WATCH_NAMESPACE env / --watch-namespaces flag in config/manager/manager.yaml and the Helm chart so the feature is discoverable.
  • Optional: make RBAC generation respect the watched-namespace set (emit Role/RoleBinding instead of ClusterRole when a scope is requested).

Happy to iterate on naming (--watch-namespaces vs --namespaces vs adopting the prometheus-operator style multi-flag split) or on defaults if maintainers prefer a different convention.

Both BatchSandboxReconciler and PoolReconciler declare Owns(&corev1.Pod{}).
Because cmd/controller/main.go does not configure Cache.DefaultNamespaces,
controller-runtime builds a shared Pod informer whose list/watch scope is
the full cluster. On clusters with many unrelated workloads, this causes
the manager to hold the entire cluster's Pod list in its cache even when
only one namespace contains OpenSandbox CRs, which can lead to OOMKills
during cold-start list+decode+index.

Add a new --watch-namespaces flag (comma-separated) that defaults to the
WATCH_NAMESPACE environment variable. When set, the value is wired into
cache.Options.DefaultNamespaces so the shared informer cache only serves
the listed namespaces. When unset, behavior is unchanged (cluster-scoped
cache, watches all namespaces) for backward compatibility.

- Pool and BatchSandbox CRDs are already namespaced.
- Reconcile-time list/get calls already restrict to the CR's own namespace.
- BatchSandbox.Spec.PoolRef has no namespace field, so cross-namespace
  allocation is not an existing feature and is not regressed.
@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


zjh7890 seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

Address review feedback on the initial commit:

- buildWatchNamespaces now returns (map[string]cache.Config, error).
  - Empty / whitespace-only input -> (nil, nil) preserves cluster-wide default.
  - Non-empty input that yields no valid tokens returns an error instead of
    silently falling back to cluster-wide. This matters because the flag's
    purpose is to narrow scope; silent fallback would broaden privileges.
  - Empty segments inside a non-empty list (",", "ns-a,,ns-b", "ns-a,")
    are now rejected rather than skipped, so operator/templating mistakes
    surface instead of being masked.
  - Each token is validated with validation.IsDNS1123Label so typos like
    "Bad_NS" or overlong names fail fast at startup.
- Duplicates in a valid list continue to be collapsed (set semantics).
- main() exits non-zero with a clear setupLog.Error if parsing fails.
- Startup log sorts the watched namespace list so log output is stable
  across restarts (Go map iteration is intentionally unordered).
- Add table-driven test cmd/controller/main_test.go covering empty,
  whitespace, single, multiple, duplicate, mid-empty, leading/trailing
  empty, DNS-1123 violations (case, underscore), max-length (63) and
  overlong (64) cases.
flag.BoolVar(&logCompress, "log-compress", true, "Compress determines if the rotated log files should be compressed using gzip")
flag.Float64Var(&kubeClientQPS, "kube-client-qps", 100, "QPS for Kubernetes client rate limiter.")
flag.IntVar(&kubeClientBurst, "kube-client-burst", 200, "Burst for Kubernetes client rate limiter.")
flag.StringVar(&watchNamespaces, "watch-namespaces", os.Getenv("WATCH_NAMESPACE"),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The watch-namespace flag name is weird, How about namespace, moreover, In my opinion, there is no need to support value from env.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants