Skip to content

fix(rl): add exception handling and cleanup guarantee to actor_cli#3238

Closed
wadeKeith wants to merge 1 commit intohuggingface:mainfrom
wadeKeith:fix/actor-cli-exception-handling-clean
Closed

fix(rl): add exception handling and cleanup guarantee to actor_cli#3238
wadeKeith wants to merge 1 commit intohuggingface:mainfrom
wadeKeith:fix/actor-cli-exception-handling-clean

Conversation

@wadeKeith
Copy link
Copy Markdown
Contributor

Problem

Fixes #3059.

The actor_cli function is the core entry point for the Actor process. When act_with_policy() throws an uncaught exception, all subsequent cleanup logic is skipped:

  1. Resource leak: transitions_queue.close(), interactions_queue.close(), parameters_queue.close() never execute, leaving GPU/CPU resources consumed
  2. No diagnostics: The crash reason is not logged, making production debugging extremely difficult
  3. Missing exit log: No way to confirm whether the Actor process exited normally or crashed

Fix

Wrapped the act_with_policy() call and cleanup logic in try/except/finally:

  • try: Runs act_with_policy() and logs normal completion
  • except: Logs the exception with full traceback via logging.exception(), sets shutdown_event to signal child processes, then re-raises
  • finally: Always executes queue close + process join + queue cancel_join_thread, guaranteeing cleanup regardless of how act_with_policy() exits

Changes

  • src/lerobot/rl/actor.py: Restructured actor_cli cleanup into try/except/finally block

Fixes huggingface#3059. When act_with_policy throws an uncaught exception, queue
closing and process joining logic was skipped, causing resource leaks
(GPU/CPU). Wrapped the main body in try/except/finally to ensure:

1. Exceptions are logged with full traceback for debugging
2. shutdown_event is set on crash to signal child processes
3. Queue close + process join always runs in the finally block
4. Clear exit logging for both normal and abnormal termination
Copilot AI review requested due to automatic review settings March 29, 2026 09:33
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Improves robustness/observability of the RL Actor entrypoint (actor_cli) by ensuring exceptions from act_with_policy() are logged and cleanup is attempted via a try/except/finally structure.

Changes:

  • Wraps act_with_policy() in try/except/finally to log crashes with traceback and run cleanup consistently.
  • Signals shutdown (shutdown_event.set()) on exception to help dependent worker loops terminate.
  • Adds additional lifecycle logs around normal completion and cleanup.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/lerobot/rl/actor.py
Comment on lines +193 to +207
transitions_queue.close()
interactions_queue.close()
parameters_queue.close()

transitions_process.join()
logging.info("[ACTOR] Transitions process joined")
interactions_process.join()
logging.info("[ACTOR] Interactions process joined")
receive_policy_process.join()
logging.info("[ACTOR] Receive policy process joined")

logging.info("[ACTOR] Cancelling queue join threads")
transitions_queue.cancel_join_thread()
interactions_queue.cancel_join_thread()
parameters_queue.cancel_join_thread()
Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The finally cleanup sequence (queue.close(), process.join(), cancel_join_thread()) can itself raise (e.g., if a queue is already closed or a process is in a bad state). If that happens, it can mask the original exception from act_with_policy(). Consider wrapping individual cleanup steps with try/except (logging failures) or contextlib.suppress(...) so diagnostics from the original crash are preserved.

Suggested change
transitions_queue.close()
interactions_queue.close()
parameters_queue.close()
transitions_process.join()
logging.info("[ACTOR] Transitions process joined")
interactions_process.join()
logging.info("[ACTOR] Interactions process joined")
receive_policy_process.join()
logging.info("[ACTOR] Receive policy process joined")
logging.info("[ACTOR] Cancelling queue join threads")
transitions_queue.cancel_join_thread()
interactions_queue.cancel_join_thread()
parameters_queue.cancel_join_thread()
try:
transitions_queue.close()
except Exception:
logging.exception("[ACTOR] Failed to close transitions_queue during cleanup")
try:
interactions_queue.close()
except Exception:
logging.exception("[ACTOR] Failed to close interactions_queue during cleanup")
try:
parameters_queue.close()
except Exception:
logging.exception("[ACTOR] Failed to close parameters_queue during cleanup")
try:
transitions_process.join()
logging.info("[ACTOR] Transitions process joined")
except Exception:
logging.exception("[ACTOR] Failed to join transitions process during cleanup")
try:
interactions_process.join()
logging.info("[ACTOR] Interactions process joined")
except Exception:
logging.exception("[ACTOR] Failed to join interactions process during cleanup")
try:
receive_policy_process.join()
logging.info("[ACTOR] Receive policy process joined")
except Exception:
logging.exception("[ACTOR] Failed to join receive policy process during cleanup")
logging.info("[ACTOR] Cancelling queue join threads")
try:
transitions_queue.cancel_join_thread()
except Exception:
logging.exception("[ACTOR] Failed to cancel join thread for transitions_queue during cleanup")
try:
interactions_queue.cancel_join_thread()
except Exception:
logging.exception("[ACTOR] Failed to cancel join thread for interactions_queue during cleanup")
try:
parameters_queue.cancel_join_thread()
except Exception:
logging.exception("[ACTOR] Failed to cancel join thread for parameters_queue during cleanup")

Copilot uses AI. Check for mistakes.
Comment thread src/lerobot/rl/actor.py
Comment on lines +209 to 212
logging.info("[ACTOR] Actor process exited")

logging.info("[ACTOR] queues closed")

Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logging.info("[ACTOR] queues closed") is outside the try/except/finally, so it will never run on the exception path because the exception is re-raised. If you want this log to be emitted regardless of success/failure, move it into the finally block (or adjust the message to reflect that it only logs on normal exit).

Copilot uses AI. Check for mistakes.
Comment thread src/lerobot/rl/actor.py
Comment on lines +178 to +202
try:
act_with_policy(
cfg=cfg,
shutdown_event=shutdown_event,
parameters_queue=parameters_queue,
transitions_queue=transitions_queue,
interactions_queue=interactions_queue,
)
logging.info("[ACTOR] act_with_policy completed normally")
except Exception:
logging.exception("[ACTOR] act_with_policy crashed with an unhandled exception")
shutdown_event.set()
raise
finally:
logging.info("[ACTOR] Closing queues")
transitions_queue.close()
interactions_queue.close()
parameters_queue.close()

transitions_process.join()
logging.info("[ACTOR] Transitions process joined")
interactions_process.join()
logging.info("[ACTOR] Interactions process joined")
receive_policy_process.join()
logging.info("[ACTOR] Receive policy process joined")
Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After act_with_policy() returns normally, shutdown_event is not set, but send_transitions() / send_interactions() loops run while not shutdown_event.is_set(). In that case the subsequent join() calls can hang, or the children may crash when the queues are closed. Consider setting shutdown_event on the normal-completion path (or unconditionally at the start of finally) before joining so the streaming loops exit cleanly.

Copilot uses AI. Check for mistakes.
@s1lent4gnt s1lent4gnt self-assigned this Mar 29, 2026
@wadeKeith
Copy link
Copy Markdown
Contributor Author

Friendly ping — just checking in on this PR to see if you have any feedback. Happy to address any questions! 🙏

@s1lent4gnt
Copy link
Copy Markdown
Member

Hi @wadeKeith, thank you for making this PR!
I will close it because it was addressed in #3063.

@s1lent4gnt s1lent4gnt closed this Apr 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

The actor_cli function is the core entry point of the Actor process, but the current code lacks log fallback when exiting abnormally

3 participants