Skip to content

fix(rl): ensure queue and process cleanup on abnormal exit#3063

Merged
pkooij merged 2 commits intohuggingface:mainfrom
jashshah999:fix/actor-cli-cleanup
Apr 13, 2026
Merged

fix(rl): ensure queue and process cleanup on abnormal exit#3063
pkooij merged 2 commits intohuggingface:mainfrom
jashshah999:fix/actor-cli-cleanup

Conversation

@jashshah999
Copy link
Copy Markdown
Contributor

What this does

Wraps the main execution in actor_cli and start_learner_threads with try/finally so that queues are closed and processes are joined even when an unhandled exception occurs.

Previously, if act_with_policy or add_actor_information_and_train threw an exception, all cleanup code was skipped -- leaking queues and child processes.

Changes

  • actor.py: Wrap act_with_policy() call in try/finally, log exception, set shutdown_event
  • learner.py: Wrap add_actor_information_and_train() call in try/finally, log exception, set shutdown_event

Fixes #3059

Wrap the main execution in actor_cli and start_learner_threads with
try/finally so that queues are closed and processes are joined even
when an unhandled exception occurs. Previously, exceptions in
act_with_policy or add_actor_information_and_train would skip all
cleanup code, leaking GPU/CPU resources.

Also sets the shutdown_event on exception so child processes exit
gracefully.

Fixes huggingface#3059
@s1lent4gnt s1lent4gnt self-assigned this Mar 2, 2026
Copy link
Copy Markdown
Member

@s1lent4gnt s1lent4gnt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!
@jashshah999 Thank you for your contributions!

@pkooij pkooij merged commit 9bd844a into huggingface:main Apr 13, 2026
4 checks passed
@s1lent4gnt s1lent4gnt mentioned this pull request Apr 15, 2026
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

The actor_cli function is the core entry point of the Actor process, but the current code lacks log fallback when exiting abnormally

3 participants