Skip to content

Surface underlying Pod and HelmRelease events in application detail view #14

@tym83

Description

@tym83

Problem

When a user deploys an application via the dashboard, the application detail view shows a single high-level status (Progressing, Ready, Failed). For long-running installs that get stuck at the Kubernetes layer — Pod cannot be scheduled, PVC stays Pending, image pull fails, init container crashes — the dashboard surface gives no signal beyond Progressing. The user has to drop to kubectl describe pod / kubectl get events to find out what is actually wrong.

Reproduction

  1. Pick any external app from the marketplace that requests heavy resources (e.g. a vLLM-style chart with 16 GiB memory request)
  2. Deploy with default spec on a cluster that cannot satisfy the request — e.g. a 3-node cluster with no GPU and modest memory
  3. Watch the application detail view in the dashboard

What the user sees

Status:    Progressing
Age:       28m
Version:   0.1.1+...
Message:   Running 'install' action with timeout of 5m0s

The status remains Progressing indefinitely — Flux's helm-controller spins through 5-minute install timeouts and retries while the underlying Pod can never be scheduled.

What is actually happening (visible only via kubectl)

$ kubectl describe pod -l app.kubernetes.io/instance=<release>

Events:
  FailedScheduling: 0/3 nodes are available:
    1 node(s) had untolerated taint {drbd.linbit.com/lost-quorum: }
    2 Insufficient memory

Expected behavior

The application detail view should surface:

  1. Pod-level events for resources created by the chart — at minimum the most recent FailedScheduling, FailedMount, BackOff, Unhealthy events
  2. HelmRelease's lastAttemptedHelmInstall message — the helm logs currently buried in kubectl describe helmrelease
  3. A derived Reason field summarising the most actionable problem (e.g. "Insufficient memory on all nodes", "PVC not bound: no provisioner", "Image pull error: …")

This makes the chain CR → HelmRelease → Resources → Pod legible without leaving the dashboard.

Proposed solution

UI changes

  • Add an Events panel to the application detail view that lists events for all resources labelled with this release (filtered by app.kubernetes.io/instance or the HelmRelease's chart-managed-by label)
  • Add a Helm output panel showing the tail of status.history[-1] and status.lastAttemptedHelmInstall.message from the corresponding HelmRelease
  • Promote the most recent Warning-level event into the top status bar as a Reason field

Controller-side (optional, separate change)

  • The ApplicationDefinition controller could propagate a condensed Reason from the underlying HelmRelease into the CR's status.conditions[].message. The dashboard then would not need to chase HelmRelease → Pod every time — the CR itself would carry enough information for the high-level view.

Workarounds today

Operators currently rely on:

kubectl describe pod -n <tenant-ns> -l app.kubernetes.io/instance=<release>
kubectl get events -n <tenant-ns> --sort-by=.lastTimestamp | tail -20
kubectl describe helmrelease -n <tenant-ns> <release>

This is acceptable for operators familiar with Kubernetes internals but defeats the purpose of having a dashboard for application lifecycle management.

Context

Observed during the development of an external-apps catalog (vLLM, ComfyUI, JupyterHub, Langflow, n8n, Open WebUI, HolmesGPT) registered as ApplicationDefinitions in Cozystack 1.4. Several of these apps have hard resource requirements (GPU, memory) that fail to satisfy on smaller smoke clusters — the dashboard's silent Progressing state actively misleads operators into waiting hours before discovering the install can never succeed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions