RL stack refactoring by s1lent4gnt · Pull Request #3075 · huggingface/lerobot

s1lent4gnt · 2026-03-03T16:14:48Z

refactor(rl): Introduce a modular RL stack to support future RL algorithms and VLA fine-tuning with RL

Type / Scope

Type: Refactor
Scope: src/lerobot/rl/, src/lerobot/configs/, src/lerobot/processor/

Summary / Motivation

HIL-SERL is working (#504, PR #644), but training logic was coupled to the learner script. This refactor introduces RLAlgorithm, RLTrainer, and DataMixer so that adding a new algorithm = one file + config, same pattern as policies. No breaking changes to CLI or config. Phase 1 of broader plan.

Related issues

Roadmap / context: #3076 — RL Stack Refactoring call for contributions

What changed

src/lerobot/rl/algorithms/base.py (NEW): RLAlgorithm ABC, RLAlgorithmConfig draccus registry, TrainingStats dataclass.
src/lerobot/rl/algorithms/sac.py (NEW): SAC logic moved from learner; batch iterator pattern.
src/lerobot/rl/algorithms/__init__.py (NEW): Registry, make_algorithm() factory.
src/lerobot/rl/trainer.py (NEW): RLTrainer — preprocesses batches, delegates to algorithm.update().
src/lerobot/rl/data_sources/data_mixer.py (NEW): DataMixer ABC, OnlineOfflineMixer.
src/lerobot/rl/data_sources/__init__.py (NEW): Exports.
src/lerobot/rl/learner.py (MODIFIED): Delegates to algorithm.update().
src/lerobot/rl/actor.py (MODIFIED): Uses RLAlgorithm.select_action(), load_weights().
src/lerobot/configs/train.py (MODIFIED): RLAlgorithmConfig in TrainRLServerPipelineConfig.
src/lerobot/processor/normalize_processor.py (MODIFIED): Pre/post processors for observation normalization; added _reshape_visual_stats().
tests/rl/test_sac_algorithm.py (NEW): SAC algorithm tests.
tests/rl/test_trainer.py (NEW): RLTrainer tests.
tests/rl/test_data_mixer.py (NEW): DataMixer tests.
tests/rl/test_actor_learner.py (MODIFIED): Integration tests for new abstractions.
Rename policies/sac/ → policies/gaussian_actor/ (new): The policy-side directory was renamed for clarity now that RL concerns live in rl/algorithms/sac/. The policy module only owns the actor (a tanh-squashed Gaussian) plus its observation encoder.
- Classes: SACConfig → GaussianActorConfig, SACPolicy → GaussianActorPolicy, SACObservationEncoder → GaussianActorObservationEncoder.
- Factory helper: make_sac_pre_post_processors → make_gaussian_actor_pre_post_processors.
- Registry: @PreTrainedConfig.register_subclass("sac") → @PreTrainedConfig.register_subclass("gaussian_actor"), and name = "gaussian_actor" on the policy class.
- The SAC algorithm registry key is unchanged (@RLAlgorithmConfig.register_subclass("sac"), SACAlgorithm.name = "sac") — only the policy-side registry moved.
- Updated all call sites: policies/__init__.py, policies/factory.py, rl/algorithms/sac/{configuration_sac,sac_algorithm}.py, processor/hil_processor.py, docs/source/hilserl.mdx, templates/lerobot_modelcard_template.md, examples/tutorial/rl/hilserl_example.py, and all tests/.

Bug fixes included

This PR also supersedes several open RL bug fix PRs:

Keyboard EE teleop (#2345, fixes #2346): Special keys (arrows, shift, ctrl) were silently dropped; intervention state was cleared before get_teleop_events() could read it.
Reward classifier (#2355, #3148): Removed leftover normalize_inputs/normalize_targets calls; added **kwargs to Classifier.__init__() for compatibility with make_policy.
Corrupted frames on Ctrl+C (#2651): Added try/finally in control_loop to flush the image writer on any exit.
Actor/learner cleanup (#3063): Wrapped main calls in try/except/finally to ensure queue close and process join on exceptions.
Redundant device move (#3124): Removed unnecessary CPU→GPU→CPU transition move before replay_buffer.add().
ReplayBuffer race condition (#2309): Added threading.Lock to add()/sample() to prevent corrupted transitions from async prefetch.
Intervention flag overwrite (#3273, fixes #3227): Swapped info merge order and removed hardcoded {IS_INTERVENTION: False} from RobotEnv so teleop intervention is correctly detected.
Offline replay buffer (new): Included is_intervention in complementary_info sent to the learner so intervention transitions populate the offline buffer.
Visual stats broadcasting (new): Added _reshape_visual_stats() to NormalizerProcessorStep to reshape image stats from (C,) to (C, 1, 1) when provided manually via JSON config (e.g. dataset_stats in RL training configs). Without this, normalizing (B, C, H, W) image tensors with flat (C,) stats fails due to incorrect broadcasting. Stats from compute_stats (imitation learning) are already (C, 1, 1) and are unaffected.

HIL-SERL pipeline fixes (new)

Bugs surfaced while validating end-to-end HIL-SERL training on SO100 and gym-hil:

Gripper penalty wiring (356a64d8): discrete_penalty was never propagated from info to complementary_data in GymHILAdapterProcessorStep, so the gripper penalty term never reached the reward signal. Also reworked GripperPenaltyProcessorStep to fire only on real transitions (is_open & cmd_close or is_closed & cmd_open) with configurable thresholds (open_threshold=0.1, closed_threshold=0.9), and bumped the default penalty from -0.01 to -0.02. Plus plumbed raw_joint_positions from env.get_raw_joint_positions() through complementary_data (needed by SO100 for kinematics).
SO100 discrete gripper inverted (a64e6f50): Discrete gripper command {0, 1, 2} was mapped without the SO100 sign flip (joint position increases on close), so close↔open commands were swapped. Standardized the mapping to {0=close, 1=stay, 2=open} to match GamepadTeleop.GripperAction.
Neutral gripper action (eda47eca): With the new mapping, the teleop "stay" action is 1 (was 0, which is now "close" — i.e. the policy-paused robot was being commanded to keep closing the gripper).
Actor / gym_manipulator reset duplication (77d18659): Extracted reset_and_build_transition(env, env_processor, action_processor) so the actor and the gym_manipulator control loop share one reset path and can't drift.
Gripper position key and reset guard (d4a568ee): Fixed the raw joint position key lookup from GRIPPER_KEY to GRIPPER_KEY.pos in GripperPenaltyProcessorStep. Also added an early-return guard when action is None (which happens on the first call after reset before any action has been applied), preventing a crash during episode initialization.
Pre-step observation alignment (0d60a855): The dataset was recording the post-step observation together with the action and reward, breaking the standard (obs, action, next_reward) RL tuple. Fixed by capturing the observation before calling step_env_and_process_transition, so each recorded frame correctly stores the observation that the agent saw when it selected the action.

External dependency updates

helper2424/resnet10 encoder (used by SAC/HIL-SERL for vision encoding): Updated to work with Transformers v5 and Hugging Face Hub v1. Added self.post_init() for proper v5 weight initialization and model registration. Pinned jax>=0.4.30,<0.5.0 to fix pickle deserialization of legacy SERL weights (named_shape removed in JAX 0.5+). Added requests as explicit dependency (no longer transitive since huggingface-hub v1 switched to httpx). See helper2424/resnet10#7 and helper2424/resnet10#8.

Breaking changes

Policy registry key: "sac" → "gaussian_actor". Existing JSON configs must update "policy.type": "sac" to "policy.type": "gaussian_actor", and existing Hub checkpoints with "type": "sac" in config.json must migrate.

How was this tested

Unit tests: pytest -q tests/rl/
Full suite: pytest
Tested with: gym-hil pick cube (PandaPickCubeGamepad-v0) — learns a successful pick-cube policy in ~15 minutes of training.
Tested with: SO100 real robot HIL-SERL training (actor + learner) — pick-and-place cube task achieves a high success rate in approximately one hour of training with human-in-the-loop intervention.
ReplayBuffer race condition verified with a stress test (concurrent add/sample threads): 0 corrupted samples across 80,000 checks after fix (14-22 corrupted per run before fix).
Post-rename: pytest tests/rl/ tests/processor/ tests/policies/ passes (432 processor, 68 RL, all gaussian_actor/classifier tests green; the 3 pre-existing failures in test_sac_algorithm_target_entropy* and test_gaussian_actor_policy_save_and_load_with_discrete_critic are unrelated to the rename).

How to run locally (reviewer)

# Run RL tests
pytest -q tests/rl/

# RL training (two terminals, requires gym-hil)
python -m lerobot.rl.learner --config_path json/train_gym_hil.json
python -m lerobot.rl.actor --config_path json/train_gym_hil.json

Checklist (required before merge)

Linting/formatting run (pre-commit run -a)
All tests pass locally (pytest)
Documentation updated
CI is green

Reviewer notes

The core abstraction to review is RLAlgorithm in algorithms/base.py — this is the contract that every future RL algorithm will implement. Feedback on the interface design is especially welcome.
The update(batch_iterator) pattern is intentional: algorithms own the gradient-step loop (including UTD ratio), while the trainer owns data mixing and preprocessing.
SAC is the only algorithm for now. The abstractions are validated by the fact that SAC works through them.
The ReplayBuffer lock holds during tensor reads/writes only; image augmentation runs outside the lock to minimize contention.
safe_stop_image_writer now catches BaseException — this also benefits lerobot_record.py and hil_data_collection.py which use the same decorator.
The move_transition_to_device removal is safe because ReplayBuffer.add() uses .copy_() which handles cross-device transfers, and ReplayBuffer.sample() already moves batches to the training device.
_reshape_visual_stats() only reshapes ndim == 1 tensors for VISUAL features — stats already shaped as (C, 1, 1) (from compute_stats) are unaffected, so this doesn't change imitation learning behavior.

jackvial · 2026-03-14T19:21:11Z

@s1lent4gnt

The new structure makes sense and I like the naming "algorithms" for training related methods, it aligns with the research and stays generic.
When I first looked at the roadmap it was unclear to me how RECAP would fit into this structure. Mainly because pi*0.6/RECAP describes rollouts with interventions on chunked VLA policies, but now that I'm a bit more familiar with RECAP I can see that the value function training and advantage conditioned training code would live under algorithms. But the inference would be separate from the RL actor and learner since the RECAP rollouts don't interact with an RL environment? RECAP rollouts are regular VLA inference with data recording and a binary success/fail label, and optionally intervention labels with per step intervention labels.
- So I think it would be useful to include details about RECAP rollouts, intervention, and reward labeling in the roadmap
Clearer instructions and documentation for testing the actor and learner would really help with onboarding both users and potential contributes that are new to the RL module:
- Provide pretrained weights for the frank lift cube task
- Provide an example config for testing
- Describe what to expect during training, the controls, and bit about intervention

s1lent4gnt · 2026-03-17T21:46:56Z

@s1lent4gnt

The new structure makes sense and I like the naming "algorithms" for training related methods, it aligns with the research and stays generic.

When I first looked at the roadmap it was unclear to me how RECAP would fit into this structure. Mainly because pi*0.6/RECAP describes rollouts with interventions on chunked VLA policies, but now that I'm a bit more familiar with RECAP I can see that the value function training and advantage conditioned training code would live under algorithms. But the inference would be separate from the RL actor and learner since the RECAP rollouts don't interact with an RL environment? RECAP rollouts are regular VLA inference with data recording and a binary success/fail label, and optionally intervention labels with per step intervention labels.

So I think it would be useful to include details about RECAP rollouts, intervention, and reward labeling in the roadmap

Clearer instructions and documentation for testing the actor and learner would really help with onboarding both users and potential contributes that are new to the RL module:

Provide pretrained weights for the frank lift cube task

Provide an example config for testing

Describe what to expect during training, the controls, and bit about intervention

Hey @jackvial thanks for taking the time to review this and for the thoughtful feedback!

You're right on the value function. RECAPAlgorithm will own the value function.

On RECAP rollouts, great point. RECAP rollouts are indeed separate from the RL actor/learner loop — they are standard VLA inference sessions with data recording, binary success/fail labels, and optional per-step intervention labels. No RL environment involved. I'll add a dedicated section to the roadmap covering: the rollout collection script, how intervention labels are recorded, and the reward labeling step (binary success → per-step rewards, MC returns). That should clarify the full pipeline from data collection to training.

One thing I'm still working through is where the RECAP training loop itself should live. In lerobot-train, in learner, or a dedicated script. Haven't settled on the right answer yet and would welcome your thoughts if you have a preference.

I fully agree on the docs. I'll add an example config for the Franka lift-cube task, provide pretrained weights, and document what to expect during training. That should lower the barrier significantly for new users and contributors.

…nd SAC restructuring

…ng intervention state Fixes #2345 Co-authored-by: jpizarrom <jpizarrom@gmail.com>

…ct_reward Fixes #2355

…tion between add() and sample()

… fix encoder reference

…r offline replay buffer

…26-02-16-rl-stack-refactor

imstevenpmwork · 2026-05-05T14:24:33Z

+from .trainer import RLTrainer as RLTrainer

-__all__: list[str] = []
+__all__ = [


interesting that we don't add any require_package guard in here. The hilserl tag in pyproject includes:
hilserl = ["lerobot[transformers-dep]", "gym-hil>=0.1.13,<0.2.0", "lerobot[grpcio-dep]", "lerobot[placo-dep]"]

Evaluate if this is still the case, maybe we can remove some

for those that we need, check if we add the guard in the specific file or in the init if the entire module doesn't make sense without that dependecy

You can check how we do this in dataset or policy module

Good catch. I dropped the __init__.py guard because lerobot.rl now also exports algorithm-side primitives (RLAlgorithm, RLAlgorithmConfig, RLTrainer, ReplayBuffer, DataMixer) that don't need gRPC. I've added file-level require_package guards: actor.py / learner.py / learner_service.py for grpcio, and gym_manipulator.py.

Fixed it in 0a470b0

The current implementation is fine, but ideally we would like to have as few exceptions in the pyproject.toml as possible. I wonder, wasn't it possible to add the guard at the constructors of the classes that require it ? As it currently stands it is equivalent to have the guard in the init.py of the rl module because the symbols are being exported. What we do in policies that require specific dependencies is something like:

# top of the file, no lint exception needed ... if TYPE_CHECKING or _transformers_available: from transformers import CLIPTextModel, CLIPVisionModel else: CLIPTextModel = None CLIPVisionModel = None ... # Then we add the guard at runtime when the class is initialized class MultiTaskDiTPolicy(PreTrainedPolicy): config_class = MultiTaskDiTConfig name = "multi_task_dit" def __init__(self, config: MultiTaskDiTConfig, **kwargs): require_package("transformers", extra="multi_task_dit") require_package("diffusers", extra="multi_task_dit")

I understand if this is not possible in this case, but from a quick look I think it would

…clarity

…orithm

…orithm and DataMixer

…mConfig and update related tests

…t modules

…26-02-16-rl-stack-refactor # Conflicts: # src/lerobot/policies/factory.py

imstevenpmwork · 2026-05-07T13:36:11Z

    # Check if this is a GymHIL simulation environment
    if cfg.name == "gym_hil":
        assert cfg.robot is None and cfg.teleop is None, "GymHIL environment does not support robot or teleop"
+        from lerobot.utils.import_utils import require_package


prioritize imports at the top level

github-actions Bot added tests Problems with test coverage, failures, or improvements to testing configuration Problems with configuration files or settings processor Issue related to processor labels Mar 3, 2026

s1lent4gnt mentioned this pull request Mar 3, 2026

RL Stack Refactoring: Call for Contributions #3076

Open

15 tasks

s1lent4gnt self-assigned this Mar 3, 2026

s1lent4gnt added the rl label Mar 3, 2026

github-actions Bot added policies Items related to robot policies examples Issues related to the examples labels Mar 11, 2026

imstevenpmwork mentioned this pull request Apr 3, 2026

Release 0.6.0 #3134

Open

s1lent4gnt force-pushed the user/khalil-meftah/2026-02-16-rl-stack-refactor branch 2 times, most recently from 840ab2b to 640ff68 Compare April 8, 2026 16:04

refactor: RL stack refactoring — RLAlgorithm, RLTrainer, DataMixer, a…

e022207

…nd SAC restructuring

s1lent4gnt force-pushed the user/khalil-meftah/2026-02-16-rl-stack-refactor branch from fe961a8 to e022207 Compare April 13, 2026 09:40

s1lent4gnt and others added 2 commits April 13, 2026 11:49

chore: clarify torch.compile disabled note in SACAlgorithm

036b310

fix(teleop): keyboard EE teleop not registering special keys and losi…

11a0b01

…ng intervention state Fixes #2345 Co-authored-by: jpizarrom <jpizarrom@gmail.com>

github-actions Bot added the robots Issues concerning robots HW interfaces label Apr 13, 2026

This was referenced Apr 13, 2026

HIL-SERL - Keyboard teleop is not working #2346

Closed

fix HIL-SERL keyboard teleop #2345

Closed

fix(reward_classifier): remove leftover normalization calls from predict_reward #2355

Closed

fix: remove leftover normalization calls from reward classifier predi…

9422dc9

…ct_reward Fixes #2355

s1lent4gnt force-pushed the user/khalil-meftah/2026-02-16-rl-stack-refactor branch from e0e19dc to 9422dc9 Compare April 13, 2026 11:31

s1lent4gnt added 3 commits April 13, 2026 18:27

fix: add thread synchronization to ReplayBuffer to prevent race condi…

7b0bdf2

…tion between add() and sample()

refactor: update SACAlgorithm to pass action_dim to _init_critics and…

ee0814e

… fix encoder reference

perf: remove redundant CPU→GPU→CPU transition move in learner

a8838c0

s1lent4gnt mentioned this pull request Apr 14, 2026

remove redundant transition device move before replay-buffer add #3124

Closed

3 tasks

Fix: add kwargs in reward classifier __init__()

e6d2821

s1lent4gnt mentioned this pull request Apr 14, 2026

Add kwargs to Classifier constructor #3148

Closed

4 tasks

fix: include IS_INTERVENTION in complementary_info sent to learner fo…

64dc08c

…r offline replay buffer

s1lent4gnt added 2 commits May 4, 2026 19:58

Merge remote-tracking branch 'origin/main' into user/khalil-meftah/20…

2bc273c

…26-02-16-rl-stack-refactor

refactor: clean up import statements

ebe6ea3