Skip to content

RL stack refactoring#3075

Open
s1lent4gnt wants to merge 58 commits intomainfrom
user/khalil-meftah/2026-02-16-rl-stack-refactor
Open

RL stack refactoring#3075
s1lent4gnt wants to merge 58 commits intomainfrom
user/khalil-meftah/2026-02-16-rl-stack-refactor

Conversation

@s1lent4gnt
Copy link
Copy Markdown
Member

@s1lent4gnt s1lent4gnt commented Mar 3, 2026

refactor(rl): Introduce a modular RL stack to support future RL algorithms and VLA fine-tuning with RL

Type / Scope

  • Type: Refactor
  • Scope: src/lerobot/rl/, src/lerobot/configs/, src/lerobot/processor/

Summary / Motivation

HIL-SERL is working (#504, PR #644), but training logic was coupled to the learner script. This refactor introduces RLAlgorithm, RLTrainer, and DataMixer so that adding a new algorithm = one file + config, same pattern as policies. No breaking changes to CLI or config. Phase 1 of broader plan.

Related issues

  • Roadmap / context: #3076 — RL Stack Refactoring call for contributions

What changed

  • src/lerobot/rl/algorithms/base.py (NEW): RLAlgorithm ABC, RLAlgorithmConfig draccus registry, TrainingStats dataclass.
  • src/lerobot/rl/algorithms/sac.py (NEW): SAC logic moved from learner; batch iterator pattern.
  • src/lerobot/rl/algorithms/__init__.py (NEW): Registry, make_algorithm() factory.
  • src/lerobot/rl/trainer.py (NEW): RLTrainer — preprocesses batches, delegates to algorithm.update().
  • src/lerobot/rl/data_sources/data_mixer.py (NEW): DataMixer ABC, OnlineOfflineMixer.
  • src/lerobot/rl/data_sources/__init__.py (NEW): Exports.
  • src/lerobot/rl/learner.py (MODIFIED): Delegates to algorithm.update().
  • src/lerobot/rl/actor.py (MODIFIED): Uses RLAlgorithm.select_action(), load_weights().
  • src/lerobot/configs/train.py (MODIFIED): RLAlgorithmConfig in TrainRLServerPipelineConfig.
  • src/lerobot/processor/normalize_processor.py (MODIFIED): Pre/post processors for observation normalization; added _reshape_visual_stats().
  • tests/rl/test_sac_algorithm.py (NEW): SAC algorithm tests.
  • tests/rl/test_trainer.py (NEW): RLTrainer tests.
  • tests/rl/test_data_mixer.py (NEW): DataMixer tests.
  • tests/rl/test_actor_learner.py (MODIFIED): Integration tests for new abstractions.
  • Rename policies/sac/policies/gaussian_actor/ (new): The policy-side directory was renamed for clarity now that RL concerns live in rl/algorithms/sac/. The policy module only owns the actor (a tanh-squashed Gaussian) plus its observation encoder.
    • Classes: SACConfigGaussianActorConfig, SACPolicyGaussianActorPolicy, SACObservationEncoderGaussianActorObservationEncoder.
    • Factory helper: make_sac_pre_post_processorsmake_gaussian_actor_pre_post_processors.
    • Registry: @PreTrainedConfig.register_subclass("sac")@PreTrainedConfig.register_subclass("gaussian_actor"), and name = "gaussian_actor" on the policy class.
    • The SAC algorithm registry key is unchanged (@RLAlgorithmConfig.register_subclass("sac"), SACAlgorithm.name = "sac") — only the policy-side registry moved.
    • Updated all call sites: policies/__init__.py, policies/factory.py, rl/algorithms/sac/{configuration_sac,sac_algorithm}.py, processor/hil_processor.py, docs/source/hilserl.mdx, templates/lerobot_modelcard_template.md, examples/tutorial/rl/hilserl_example.py, and all tests/.

Bug fixes included

This PR also supersedes several open RL bug fix PRs:

  • Keyboard EE teleop (#2345, fixes #2346): Special keys (arrows, shift, ctrl) were silently dropped; intervention state was cleared before get_teleop_events() could read it.
  • Reward classifier (#2355, #3148): Removed leftover normalize_inputs/normalize_targets calls; added **kwargs to Classifier.__init__() for compatibility with make_policy.
  • Corrupted frames on Ctrl+C (#2651): Added try/finally in control_loop to flush the image writer on any exit.
  • Actor/learner cleanup (#3063): Wrapped main calls in try/except/finally to ensure queue close and process join on exceptions.
  • Redundant device move (#3124): Removed unnecessary CPU→GPU→CPU transition move before replay_buffer.add().
  • ReplayBuffer race condition (#2309): Added threading.Lock to add()/sample() to prevent corrupted transitions from async prefetch.
  • Intervention flag overwrite (#3273, fixes #3227): Swapped info merge order and removed hardcoded {IS_INTERVENTION: False} from RobotEnv so teleop intervention is correctly detected.
  • Offline replay buffer (new): Included is_intervention in complementary_info sent to the learner so intervention transitions populate the offline buffer.
  • Visual stats broadcasting (new): Added _reshape_visual_stats() to NormalizerProcessorStep to reshape image stats from (C,) to (C, 1, 1) when provided manually via JSON config (e.g. dataset_stats in RL training configs). Without this, normalizing (B, C, H, W) image tensors with flat (C,) stats fails due to incorrect broadcasting. Stats from compute_stats (imitation learning) are already (C, 1, 1) and are unaffected.

HIL-SERL pipeline fixes (new)

Bugs surfaced while validating end-to-end HIL-SERL training on SO100 and gym-hil:

  • Gripper penalty wiring (356a64d8): discrete_penalty was never propagated from info to complementary_data in GymHILAdapterProcessorStep, so the gripper penalty term never reached the reward signal. Also reworked GripperPenaltyProcessorStep to fire only on real transitions (is_open & cmd_close or is_closed & cmd_open) with configurable thresholds (open_threshold=0.1, closed_threshold=0.9), and bumped the default penalty from -0.01 to -0.02. Plus plumbed raw_joint_positions from env.get_raw_joint_positions() through complementary_data (needed by SO100 for kinematics).
  • SO100 discrete gripper inverted (a64e6f50): Discrete gripper command {0, 1, 2} was mapped without the SO100 sign flip (joint position increases on close), so close↔open commands were swapped. Standardized the mapping to {0=close, 1=stay, 2=open} to match GamepadTeleop.GripperAction.
  • Neutral gripper action (eda47eca): With the new mapping, the teleop "stay" action is 1 (was 0, which is now "close" — i.e. the policy-paused robot was being commanded to keep closing the gripper).
  • Actor / gym_manipulator reset duplication (77d18659): Extracted reset_and_build_transition(env, env_processor, action_processor) so the actor and the gym_manipulator control loop share one reset path and can't drift.
  • Gripper position key and reset guard (d4a568ee): Fixed the raw joint position key lookup from GRIPPER_KEY to GRIPPER_KEY.pos in GripperPenaltyProcessorStep. Also added an early-return guard when action is None (which happens on the first call after reset before any action has been applied), preventing a crash during episode initialization.
  • Pre-step observation alignment (0d60a855): The dataset was recording the post-step observation together with the action and reward, breaking the standard (obs, action, next_reward) RL tuple. Fixed by capturing the observation before calling step_env_and_process_transition, so each recorded frame correctly stores the observation that the agent saw when it selected the action.

External dependency updates

  • helper2424/resnet10 encoder (used by SAC/HIL-SERL for vision encoding): Updated to work with Transformers v5 and Hugging Face Hub v1. Added self.post_init() for proper v5 weight initialization and model registration. Pinned jax>=0.4.30,<0.5.0 to fix pickle deserialization of legacy SERL weights (named_shape removed in JAX 0.5+). Added requests as explicit dependency (no longer transitive since huggingface-hub v1 switched to httpx). See helper2424/resnet10#7 and helper2424/resnet10#8.

Breaking changes

  • Policy registry key: "sac""gaussian_actor". Existing JSON configs must update "policy.type": "sac" to "policy.type": "gaussian_actor", and existing Hub checkpoints with "type": "sac" in config.json must migrate.

How was this tested

  • Unit tests: pytest -q tests/rl/
  • Full suite: pytest
  • Tested with: gym-hil pick cube (PandaPickCubeGamepad-v0) — learns a successful pick-cube policy in ~15 minutes of training.
  • Tested with: SO100 real robot HIL-SERL training (actor + learner) — pick-and-place cube task achieves a high success rate in approximately one hour of training with human-in-the-loop intervention.
  • ReplayBuffer race condition verified with a stress test (concurrent add/sample threads): 0 corrupted samples across 80,000 checks after fix (14-22 corrupted per run before fix).
  • Post-rename: pytest tests/rl/ tests/processor/ tests/policies/ passes (432 processor, 68 RL, all gaussian_actor/classifier tests green; the 3 pre-existing failures in test_sac_algorithm_target_entropy* and test_gaussian_actor_policy_save_and_load_with_discrete_critic are unrelated to the rename).

How to run locally (reviewer)

# Run RL tests
pytest -q tests/rl/

# RL training (two terminals, requires gym-hil)
python -m lerobot.rl.learner --config_path json/train_gym_hil.json
python -m lerobot.rl.actor --config_path json/train_gym_hil.json

Checklist (required before merge)

  • Linting/formatting run (pre-commit run -a)
  • All tests pass locally (pytest)
  • Documentation updated
  • CI is green

Reviewer notes

  • The core abstraction to review is RLAlgorithm in algorithms/base.py — this is the contract that every future RL algorithm will implement. Feedback on the interface design is especially welcome.
  • The update(batch_iterator) pattern is intentional: algorithms own the gradient-step loop (including UTD ratio), while the trainer owns data mixing and preprocessing.
  • SAC is the only algorithm for now. The abstractions are validated by the fact that SAC works through them.
  • The ReplayBuffer lock holds during tensor reads/writes only; image augmentation runs outside the lock to minimize contention.
  • safe_stop_image_writer now catches BaseException — this also benefits lerobot_record.py and hil_data_collection.py which use the same decorator.
  • The move_transition_to_device removal is safe because ReplayBuffer.add() uses .copy_() which handles cross-device transfers, and ReplayBuffer.sample() already moves batches to the training device.
  • _reshape_visual_stats() only reshapes ndim == 1 tensors for VISUAL features — stats already shaped as (C, 1, 1) (from compute_stats) are unaffected, so this doesn't change imitation learning behavior.

@github-actions github-actions Bot added tests Problems with test coverage, failures, or improvements to testing configuration Problems with configuration files or settings processor Issue related to processor labels Mar 3, 2026
@s1lent4gnt s1lent4gnt self-assigned this Mar 3, 2026
@s1lent4gnt s1lent4gnt added the rl label Mar 3, 2026
@github-actions github-actions Bot added policies Items related to robot policies examples Issues related to the examples labels Mar 11, 2026
@jackvial
Copy link
Copy Markdown
Contributor

@s1lent4gnt

  • The new structure makes sense and I like the naming "algorithms" for training related methods, it aligns with the research and stays generic.
  • When I first looked at the roadmap it was unclear to me how RECAP would fit into this structure. Mainly because pi*0.6/RECAP describes rollouts with interventions on chunked VLA policies, but now that I'm a bit more familiar with RECAP I can see that the value function training and advantage conditioned training code would live under algorithms. But the inference would be separate from the RL actor and learner since the RECAP rollouts don't interact with an RL environment? RECAP rollouts are regular VLA inference with data recording and a binary success/fail label, and optionally intervention labels with per step intervention labels.
    • So I think it would be useful to include details about RECAP rollouts, intervention, and reward labeling in the roadmap
  • Clearer instructions and documentation for testing the actor and learner would really help with onboarding both users and potential contributes that are new to the RL module:
    • Provide pretrained weights for the frank lift cube task
    • Provide an example config for testing
    • Describe what to expect during training, the controls, and bit about intervention

@s1lent4gnt
Copy link
Copy Markdown
Member Author

@s1lent4gnt

  • The new structure makes sense and I like the naming "algorithms" for training related methods, it aligns with the research and stays generic.

  • When I first looked at the roadmap it was unclear to me how RECAP would fit into this structure. Mainly because pi*0.6/RECAP describes rollouts with interventions on chunked VLA policies, but now that I'm a bit more familiar with RECAP I can see that the value function training and advantage conditioned training code would live under algorithms. But the inference would be separate from the RL actor and learner since the RECAP rollouts don't interact with an RL environment? RECAP rollouts are regular VLA inference with data recording and a binary success/fail label, and optionally intervention labels with per step intervention labels.

    • So I think it would be useful to include details about RECAP rollouts, intervention, and reward labeling in the roadmap
  • Clearer instructions and documentation for testing the actor and learner would really help with onboarding both users and potential contributes that are new to the RL module:

    • Provide pretrained weights for the frank lift cube task
    • Provide an example config for testing
    • Describe what to expect during training, the controls, and bit about intervention

Hey @jackvial thanks for taking the time to review this and for the thoughtful feedback!

You're right on the value function. RECAPAlgorithm will own the value function.

On RECAP rollouts, great point. RECAP rollouts are indeed separate from the RL actor/learner loop — they are standard VLA inference sessions with data recording, binary success/fail labels, and optional per-step intervention labels. No RL environment involved. I'll add a dedicated section to the roadmap covering: the rollout collection script, how intervention labels are recorded, and the reward labeling step (binary success → per-step rewards, MC returns). That should clarify the full pipeline from data collection to training.

One thing I'm still working through is where the RECAP training loop itself should live. In lerobot-train, in learner, or a dedicated script. Haven't settled on the right answer yet and would welcome your thoughts if you have a preference.

I fully agree on the docs. I'll add an example config for the Franka lift-cube task, provide pretrained weights, and document what to expect during training. That should lower the barrier significantly for new users and contributors.

@imstevenpmwork imstevenpmwork mentioned this pull request Apr 3, 2026
@s1lent4gnt s1lent4gnt force-pushed the user/khalil-meftah/2026-02-16-rl-stack-refactor branch 2 times, most recently from 840ab2b to 640ff68 Compare April 8, 2026 16:04
@s1lent4gnt s1lent4gnt force-pushed the user/khalil-meftah/2026-02-16-rl-stack-refactor branch from fe961a8 to e022207 Compare April 13, 2026 09:40
@s1lent4gnt s1lent4gnt force-pushed the user/khalil-meftah/2026-02-16-rl-stack-refactor branch from e0e19dc to 9422dc9 Compare April 13, 2026 11:31
Comment thread src/lerobot/policies/gaussian_actor/configuration_gaussian_actor.py
Comment thread src/lerobot/policies/gaussian_actor/configuration_gaussian_actor.py Outdated
Comment thread src/lerobot/policies/gaussian_actor/modeling_gaussian_actor.py Outdated
Comment thread src/lerobot/processor/normalize_processor.py
Comment thread src/lerobot/rl/algorithms/sac/configuration_sac.py Outdated
Comment thread src/lerobot/rl/algorithms/sac/configuration_sac.py Outdated
Comment thread src/lerobot/rl/algorithms/sac/sac_algorithm.py Outdated
Comment thread src/lerobot/rl/algorithms/base.py Outdated
Comment thread src/lerobot/rl/algorithms/base.py Outdated
Comment thread src/lerobot/rl/algorithms/base.py Outdated
Comment thread src/lerobot/rl/algorithms/configs.py Outdated
Comment thread src/lerobot/rl/algorithms/factory.py Outdated
Comment thread src/lerobot/rl/data_sources/data_mixer.py Outdated
from .trainer import RLTrainer as RLTrainer

__all__: list[str] = []
__all__ = [
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting that we don't add any require_package guard in here. The hilserl tag in pyproject includes:
hilserl = ["lerobot[transformers-dep]", "gym-hil>=0.1.13,<0.2.0", "lerobot[grpcio-dep]", "lerobot[placo-dep]"]

  1. Evaluate if this is still the case, maybe we can remove some
  2. for those that we need, check if we add the guard in the specific file or in the init if the entire module doesn't make sense without that dependecy

You can check how we do this in dataset or policy module

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I dropped the __init__.py guard because lerobot.rl now also exports algorithm-side primitives (RLAlgorithm, RLAlgorithmConfig, RLTrainer, ReplayBuffer, DataMixer) that don't need gRPC. I've added file-level require_package guards: actor.py / learner.py / learner_service.py for grpcio, and gym_manipulator.py.

Fixed it in 0a470b0

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current implementation is fine, but ideally we would like to have as few exceptions in the pyproject.toml as possible. I wonder, wasn't it possible to add the guard at the constructors of the classes that require it ? As it currently stands it is equivalent to have the guard in the init.py of the rl module because the symbols are being exported. What we do in policies that require specific dependencies is something like:

# top of the file, no lint exception needed
...
if TYPE_CHECKING or _transformers_available:
    from transformers import CLIPTextModel, CLIPVisionModel
else:
    CLIPTextModel = None
    CLIPVisionModel = None
...

# Then we add the guard at runtime when the class is initialized
class MultiTaskDiTPolicy(PreTrainedPolicy):
    config_class = MultiTaskDiTConfig
    name = "multi_task_dit"

    def __init__(self, config: MultiTaskDiTConfig, **kwargs):
        require_package("transformers", extra="multi_task_dit")
        require_package("diffusers", extra="multi_task_dit")

I understand if this is not possible in this case, but from a quick look I think it would

Comment thread src/lerobot/rl/actor.py Outdated
Comment thread src/lerobot/rl/crop_dataset_roi.py
Comment thread src/lerobot/teleoperators/keyboard/teleop_keyboard.py
Comment thread src/lerobot/rl/trainer.py Outdated
Comment thread src/lerobot/rl/algorithms/sac/configuration_sac.py
Comment thread src/lerobot/rl/train_rl.py Outdated
# Check if this is a GymHIL simulation environment
if cfg.name == "gym_hil":
assert cfg.robot is None and cfg.teleop is None, "GymHIL environment does not support robot or teleop"
from lerobot.utils.import_utils import require_package
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prioritize imports at the top level

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

configuration Problems with configuration files or settings documentation Improvements or fixes to the project’s docs examples Issues related to the examples policies Items related to robot policies processor Issue related to processor rl robots Issues concerning robots HW interfaces tests Problems with test coverage, failures, or improvements to testing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

gym_manipulator: Teleop intervention flag gets overwritten to False in step_env_and_process_transition HIL-SERL - Keyboard teleop is not working

6 participants