Skip to content

fix(bootstrap): surface Helm install failure on namespace timeout (#211)#486

Open
Manoj-engineer wants to merge 1 commit intoNVIDIA:mainfrom
Manoj-engineer:fix/helm-error-diagnosis-211
Open

fix(bootstrap): surface Helm install failure on namespace timeout (#211)#486
Manoj-engineer wants to merge 1 commit intoNVIDIA:mainfrom
Manoj-engineer:fix/helm-error-diagnosis-211

Conversation

@Manoj-engineer
Copy link

Vouched at: #420

Summary

when gateway start times out waiting for the openshell namespace, the error
message now checks for failed helm-install-* jobs in kube-system and surfaces
the actual Helm error and last 30 log lines instead of the generic "namespace not ready" message.

Related Issue

Fixes #211

Changes

  • Add diagnose_helm_failure() in openshell-bootstrap/src/lib.rs that queries
    helm-install-* jobs in kube-system for failed pods and returns job conditions
    • last 30 log lines
  • Wire into wait_for_namespace() final timeout branch
  • Fix awk filter: status.failed stays <none> during backoff retry window;
    filter on != "0" instead of != "<none>" && != "0" to catch actively-failing jobs
  • Add unit test helm_failure_hint_is_included_in_namespace_timeout_message

Testing

  • mise run pre-commit passes
  • Unit tests added/updated — 78/78 pass
  • E2E tests added/updated (not applicable — error path only)
  • Live-tested end-to-end: built a gateway image with corrupted serviceaccount.yaml,
    confirmed the Helm error appears in the terminal output on timeout

Checklist

  • Follows Conventional Commits
  • Commits are signed off (DCO)
  • Architecture docs updated (not applicable)

@Manoj-engineer Manoj-engineer requested a review from a team as a code owner March 19, 2026 21:42
@drew drew self-assigned this Mar 20, 2026
@drew drew added the test:e2e Requires end-to-end coverage label Mar 21, 2026
Comment on lines +972 to +986
let (job_output, job_exit) = exec_capture_with_exit(
docker,
container_name,
vec![
"sh".to_string(),
"-c".to_string(),
format!(
"KUBECONFIG={kubeconfig} kubectl get jobs -n kube-system \
--no-headers -o custom-columns=NAME:.metadata.name,FAILED:.status.failed \
2>/dev/null | awk '{{if ($2 != \"0\") print $1}}'"
),
],
)
.await
.ok()?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make sure this won't return false positives. If the namespace timeout happens for any other reason, this predicate will still select completed k3s Helm jobs because .status.failed is usually absent for successful Jobs.

Comment on lines -51 to 55
GatewayMetadata, clear_active_gateway, extract_host_from_ssh_destination, get_gateway_metadata,
list_gateways, load_active_gateway, load_gateway_metadata, load_last_sandbox,
remove_gateway_metadata, resolve_ssh_hostname, save_active_gateway, save_last_sandbox,
store_gateway_metadata,
GatewayMetadata, clear_active_gateway, extract_host_from_ssh_destination,
get_gateway_metadata, list_gateways, load_active_gateway, load_gateway_metadata,
load_last_sandbox, remove_gateway_metadata, resolve_ssh_hostname, save_active_gateway,
save_last_sandbox, store_gateway_metadata,
};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review failing CI checks

[rust:format:check] $ cargo fmt --all -- --check
info: syncing channel updates for stable-x86_64-unknown-linux-gnu
info: latest update on 2026-03-05 for version 1.94.0 (4a4ef493e 2026-03-02)
info: downloading 6 components
Diff in /__w/OpenShell/OpenShell/crates/openshell-bootstrap/src/lib.rs:48:
     DockerPreflight, ExistingGatewayInfo, check_docker_available, create_ssh_docker_client,
 };
 pub use crate::metadata::{
-    GatewayMetadata, clear_active_gateway, extract_host_from_ssh_destination,
-    get_gateway_metadata, list_gateways, load_active_gateway, load_gateway_metadata,
-    load_last_sandbox, remove_gateway_metadata, resolve_ssh_hostname, save_active_gateway,
-    save_last_sandbox, store_gateway_metadata,
+    GatewayMetadata, clear_active_gateway, extract_host_from_ssh_destination, get_gateway_metadata,
+    list_gateways, load_active_gateway, load_gateway_metadata, load_last_sandbox,
+    remove_gateway_metadata, resolve_ssh_hostname, save_active_gateway, save_last_sandbox,
+    store_gateway_metadata,
 };
 
 /// Options for remote SSH deployment.
[rust:format:check] ERROR task failed

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch on both. Fixed:

False positives — changed the awk filter from $2 != "0" to $2 ~ /^[1-9]/ so it only matches numeric non-zero values. (absent status.failed on successful jobs) no longer matches.

Formatting — the rebase picked up upstream's import line-wrapping change which conflicted with our format; ran cargo fmt --all to resolve it.

90/90 tests still pass.

…IDIA#211)

Signed-off-by: Manoj-engineer <194872717+Manoj-engineer@users.noreply.github.com>
@Manoj-engineer Manoj-engineer force-pushed the fix/helm-error-diagnosis-211 branch from 718d44e to 7541eb5 Compare March 21, 2026 22:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test:e2e Requires end-to-end coverage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve error message when Helm chart has malformed YAML

2 participants