RL stack refactoring#3075
Conversation
|
Hey @jackvial thanks for taking the time to review this and for the thoughtful feedback! You're right on the value function. RECAPAlgorithm will own the value function. On RECAP rollouts, great point. RECAP rollouts are indeed separate from the RL actor/learner loop — they are standard VLA inference sessions with data recording, binary success/fail labels, and optional per-step intervention labels. No RL environment involved. I'll add a dedicated section to the roadmap covering: the rollout collection script, how intervention labels are recorded, and the reward labeling step (binary success → per-step rewards, MC returns). That should clarify the full pipeline from data collection to training. One thing I'm still working through is where the RECAP training loop itself should live. In lerobot-train, in learner, or a dedicated script. Haven't settled on the right answer yet and would welcome your thoughts if you have a preference. I fully agree on the docs. I'll add an example config for the Franka lift-cube task, provide pretrained weights, and document what to expect during training. That should lower the barrier significantly for new users and contributors. |
840ab2b to
640ff68
Compare
…nd SAC restructuring
fe961a8 to
e022207
Compare
…ng intervention state Fixes #2345 Co-authored-by: jpizarrom <jpizarrom@gmail.com>
e0e19dc to
9422dc9
Compare
…tion between add() and sample()
… fix encoder reference
…r offline replay buffer
…26-02-16-rl-stack-refactor
| from .trainer import RLTrainer as RLTrainer | ||
|
|
||
| __all__: list[str] = [] | ||
| __all__ = [ |
There was a problem hiding this comment.
interesting that we don't add any require_package guard in here. The hilserl tag in pyproject includes:
hilserl = ["lerobot[transformers-dep]", "gym-hil>=0.1.13,<0.2.0", "lerobot[grpcio-dep]", "lerobot[placo-dep]"]
- Evaluate if this is still the case, maybe we can remove some
- for those that we need, check if we add the guard in the specific file or in the init if the entire module doesn't make sense without that dependecy
You can check how we do this in dataset or policy module
There was a problem hiding this comment.
Good catch. I dropped the __init__.py guard because lerobot.rl now also exports algorithm-side primitives (RLAlgorithm, RLAlgorithmConfig, RLTrainer, ReplayBuffer, DataMixer) that don't need gRPC. I've added file-level require_package guards: actor.py / learner.py / learner_service.py for grpcio, and gym_manipulator.py.
Fixed it in 0a470b0
There was a problem hiding this comment.
The current implementation is fine, but ideally we would like to have as few exceptions in the pyproject.toml as possible. I wonder, wasn't it possible to add the guard at the constructors of the classes that require it ? As it currently stands it is equivalent to have the guard in the init.py of the rl module because the symbols are being exported. What we do in policies that require specific dependencies is something like:
# top of the file, no lint exception needed
...
if TYPE_CHECKING or _transformers_available:
from transformers import CLIPTextModel, CLIPVisionModel
else:
CLIPTextModel = None
CLIPVisionModel = None
...
# Then we add the guard at runtime when the class is initialized
class MultiTaskDiTPolicy(PreTrainedPolicy):
config_class = MultiTaskDiTConfig
name = "multi_task_dit"
def __init__(self, config: MultiTaskDiTConfig, **kwargs):
require_package("transformers", extra="multi_task_dit")
require_package("diffusers", extra="multi_task_dit")I understand if this is not possible in this case, but from a quick look I think it would
…orithm and DataMixer
…mConfig and update related tests
…26-02-16-rl-stack-refactor # Conflicts: # src/lerobot/policies/factory.py
| # Check if this is a GymHIL simulation environment | ||
| if cfg.name == "gym_hil": | ||
| assert cfg.robot is None and cfg.teleop is None, "GymHIL environment does not support robot or teleop" | ||
| from lerobot.utils.import_utils import require_package |
There was a problem hiding this comment.
prioritize imports at the top level
refactor(rl): Introduce a modular RL stack to support future RL algorithms and VLA fine-tuning with RL
Type / Scope
src/lerobot/rl/,src/lerobot/configs/,src/lerobot/processor/Summary / Motivation
HIL-SERL is working (#504, PR #644), but training logic was coupled to the learner script. This refactor introduces
RLAlgorithm,RLTrainer, andDataMixerso that adding a new algorithm = one file + config, same pattern as policies. No breaking changes to CLI or config. Phase 1 of broader plan.Related issues
What changed
src/lerobot/rl/algorithms/base.py(NEW):RLAlgorithmABC,RLAlgorithmConfigdraccus registry,TrainingStatsdataclass.src/lerobot/rl/algorithms/sac.py(NEW): SAC logic moved from learner; batch iterator pattern.src/lerobot/rl/algorithms/__init__.py(NEW): Registry,make_algorithm()factory.src/lerobot/rl/trainer.py(NEW):RLTrainer— preprocesses batches, delegates toalgorithm.update().src/lerobot/rl/data_sources/data_mixer.py(NEW):DataMixerABC,OnlineOfflineMixer.src/lerobot/rl/data_sources/__init__.py(NEW): Exports.src/lerobot/rl/learner.py(MODIFIED): Delegates toalgorithm.update().src/lerobot/rl/actor.py(MODIFIED): UsesRLAlgorithm.select_action(),load_weights().src/lerobot/configs/train.py(MODIFIED):RLAlgorithmConfiginTrainRLServerPipelineConfig.src/lerobot/processor/normalize_processor.py(MODIFIED): Pre/post processors for observation normalization; added_reshape_visual_stats().tests/rl/test_sac_algorithm.py(NEW): SAC algorithm tests.tests/rl/test_trainer.py(NEW): RLTrainer tests.tests/rl/test_data_mixer.py(NEW): DataMixer tests.tests/rl/test_actor_learner.py(MODIFIED): Integration tests for new abstractions.policies/sac/→policies/gaussian_actor/(new): The policy-side directory was renamed for clarity now that RL concerns live inrl/algorithms/sac/. The policy module only owns the actor (a tanh-squashed Gaussian) plus its observation encoder.SACConfig→GaussianActorConfig,SACPolicy→GaussianActorPolicy,SACObservationEncoder→GaussianActorObservationEncoder.make_sac_pre_post_processors→make_gaussian_actor_pre_post_processors.@PreTrainedConfig.register_subclass("sac")→@PreTrainedConfig.register_subclass("gaussian_actor"), andname = "gaussian_actor"on the policy class.@RLAlgorithmConfig.register_subclass("sac"),SACAlgorithm.name = "sac") — only the policy-side registry moved.policies/__init__.py,policies/factory.py,rl/algorithms/sac/{configuration_sac,sac_algorithm}.py,processor/hil_processor.py,docs/source/hilserl.mdx,templates/lerobot_modelcard_template.md,examples/tutorial/rl/hilserl_example.py, and alltests/.Bug fixes included
This PR also supersedes several open RL bug fix PRs:
get_teleop_events()could read it.normalize_inputs/normalize_targetscalls; added**kwargstoClassifier.__init__()for compatibility withmake_policy.try/finallyincontrol_loopto flush the image writer on any exit.try/except/finallyto ensure queue close and process join on exceptions.replay_buffer.add().threading.Locktoadd()/sample()to prevent corrupted transitions from async prefetch.{IS_INTERVENTION: False}fromRobotEnvso teleop intervention is correctly detected.is_interventionincomplementary_infosent to the learner so intervention transitions populate the offline buffer._reshape_visual_stats()toNormalizerProcessorStepto reshape image stats from(C,)to(C, 1, 1)when provided manually via JSON config (e.g.dataset_statsin RL training configs). Without this, normalizing(B, C, H, W)image tensors with flat(C,)stats fails due to incorrect broadcasting. Stats fromcompute_stats(imitation learning) are already(C, 1, 1)and are unaffected.HIL-SERL pipeline fixes (new)
Bugs surfaced while validating end-to-end HIL-SERL training on SO100 and gym-hil:
356a64d8):discrete_penaltywas never propagated frominfotocomplementary_datainGymHILAdapterProcessorStep, so the gripper penalty term never reached the reward signal. Also reworkedGripperPenaltyProcessorStepto fire only on real transitions (is_open & cmd_closeoris_closed & cmd_open) with configurable thresholds (open_threshold=0.1,closed_threshold=0.9), and bumped the default penalty from-0.01to-0.02. Plus plumbedraw_joint_positionsfromenv.get_raw_joint_positions()throughcomplementary_data(needed by SO100 for kinematics).a64e6f50): Discrete gripper command{0, 1, 2}was mapped without the SO100 sign flip (joint position increases on close), so close↔open commands were swapped. Standardized the mapping to{0=close, 1=stay, 2=open}to matchGamepadTeleop.GripperAction.eda47eca): With the new mapping, the teleop "stay" action is1(was0, which is now "close" — i.e. the policy-paused robot was being commanded to keep closing the gripper).77d18659): Extractedreset_and_build_transition(env, env_processor, action_processor)so the actor and the gym_manipulator control loop share one reset path and can't drift.d4a568ee): Fixed the raw joint position key lookup fromGRIPPER_KEYtoGRIPPER_KEY.posinGripperPenaltyProcessorStep. Also added an early-return guard whenaction is None(which happens on the first call after reset before any action has been applied), preventing a crash during episode initialization.0d60a855): The dataset was recording the post-step observation together with the action and reward, breaking the standard(obs, action, next_reward)RL tuple. Fixed by capturing the observation before callingstep_env_and_process_transition, so each recorded frame correctly stores the observation that the agent saw when it selected the action.External dependency updates
helper2424/resnet10encoder (used by SAC/HIL-SERL for vision encoding): Updated to work with Transformers v5 and Hugging Face Hub v1. Addedself.post_init()for proper v5 weight initialization and model registration. Pinnedjax>=0.4.30,<0.5.0to fix pickle deserialization of legacy SERL weights (named_shaperemoved in JAX 0.5+). Addedrequestsas explicit dependency (no longer transitive sincehuggingface-hubv1 switched tohttpx). See helper2424/resnet10#7 and helper2424/resnet10#8.Breaking changes
"sac"→"gaussian_actor". Existing JSON configs must update"policy.type": "sac"to"policy.type": "gaussian_actor", and existing Hub checkpoints with"type": "sac"inconfig.jsonmust migrate.How was this tested
pytest -q tests/rl/pytestPandaPickCubeGamepad-v0) — learns a successful pick-cube policy in ~15 minutes of training.pytest tests/rl/ tests/processor/ tests/policies/passes (432 processor, 68 RL, all gaussian_actor/classifier tests green; the 3 pre-existing failures intest_sac_algorithm_target_entropy*andtest_gaussian_actor_policy_save_and_load_with_discrete_criticare unrelated to the rename).How to run locally (reviewer)
Checklist (required before merge)
pre-commit run -a)pytest)Reviewer notes
RLAlgorithminalgorithms/base.py— this is the contract that every future RL algorithm will implement. Feedback on the interface design is especially welcome.update(batch_iterator)pattern is intentional: algorithms own the gradient-step loop (including UTD ratio), while the trainer owns data mixing and preprocessing.safe_stop_image_writernow catchesBaseException— this also benefitslerobot_record.pyandhil_data_collection.pywhich use the same decorator.move_transition_to_deviceremoval is safe becauseReplayBuffer.add()uses.copy_()which handles cross-device transfers, andReplayBuffer.sample()already moves batches to the training device._reshape_visual_stats()only reshapesndim == 1tensors for VISUAL features — stats already shaped as(C, 1, 1)(fromcompute_stats) are unaffected, so this doesn't change imitation learning behavior.