Hi VLAC team,
Thanks for releasing the model and example code. We are trying to reproduce the DROID AOC-F1 / VOC-F1 style evaluation reported in the paper, but our numbers are much lower than expected, so we suspect we may be missing an important setup detail such as the camera view, reference trajectory setting, or filtering.
What we tested
We are specifically trying to reproduce the zero-shot DROID AOC-F1 / VOC-F1 performance of the released checkpoint. We did not fine-tune the model or train any adapter. We used the released Hugging Face model InternRobotics/VLAC and evaluated on DROID trajectories with the VLAC trajectory-critic logic, following the public example/code as closely as possible:
- Single
main_video_path style input, no multi-view fusion.
- No reference video:
reference_video_path=None / no ref_image_list.
think=False, rich=False, done_flag=False.
- Official score prompt format from
get_score_prompt(..., trajectory_len=0):
Image-1: <image> / Image-2: <image>
Response the relative progressing of target task follow <score>... <task> ... </task> <score>
- Official system prompt from
set_system_prompt().
- Image preprocessing matched the released helper behavior: resize each image to
448x448 and use one image patch.
- DROID videos are 15 fps. We used
frame_gap=15, which should be equivalent to the README example compress_video(..., fps=5) followed by skip=5, i.e. roughly one critic step per second.
- For
reverse_eval, we reverse the image pair and then negate the predicted critic, matching get_trajectory_critic(..., reverse_eval=True).
- We compute
value_list using critic_to_value_simple from the public code.
- VOC/VROC are computed as
spearmanr(value_list, np.arange(len(value_list))).
- VOC-F1 / AOC-F1 is computed as
0 if VOC * VROC < 0 else 2 * VOC * VROC / (VOC + VROC).
We evaluated 1000 DROID test episodes, with 17,614 trajectory steps in each direction.
Results we got
Using observation.images.exterior_image_1_left:
VOC = 0.3324
VROC = 0.8843
VOC-F1 = 0.4832
Using observation.images.exterior_image_2_left:
VOC = 0.3314
VROC = 0.9005
VOC-F1 = 0.4845
These are far below the DROID number we understood from the paper for the 2B model, around 0.79 AOC-F1 / VOC-F1. The main issue seems to be low forward VOC; VROC is high.
Questions
Could you clarify the exact DROID evaluation setup used for the reported AOC-F1/VOC-F1 numbers?
- Which DROID camera view did you use? For example:
observation.images.exterior_image_1_left
observation.images.exterior_image_2_left
observation.images.wrist_image_left
- or some selected/rendered view outside these LeRobot field names?
- Were the DROID AOC-F1/VOC-F1 results computed with a reference trajectory / one-shot context, or with only task text + main video?
- Was the DROID evaluation split filtered to successful/clean trajectories, or are all episodes included?
- What exact
fps, skip, and frame_skip settings were used for DROID?
- Is AOC-F1 in the table computed from dataset-level mean AOC/VROC, or averaged per trajectory after computing per-trajectory F1?
- Is there an official script/config for reproducing the DROID AOC-F1/VOC-F1 evaluation?
We may simply be using the wrong view or missing a filtering/reference-context detail. Any clarification would be very helpful. Thanks!
Hi VLAC team,
Thanks for releasing the model and example code. We are trying to reproduce the DROID AOC-F1 / VOC-F1 style evaluation reported in the paper, but our numbers are much lower than expected, so we suspect we may be missing an important setup detail such as the camera view, reference trajectory setting, or filtering.
What we tested
We are specifically trying to reproduce the zero-shot DROID AOC-F1 / VOC-F1 performance of the released checkpoint. We did not fine-tune the model or train any adapter. We used the released Hugging Face model
InternRobotics/VLACand evaluated on DROID trajectories with the VLAC trajectory-critic logic, following the public example/code as closely as possible:main_video_pathstyle input, no multi-view fusion.reference_video_path=None/ noref_image_list.think=False,rich=False,done_flag=False.get_score_prompt(..., trajectory_len=0):Image-1: <image>/Image-2: <image>Response the relative progressing of target task follow <score>... <task> ... </task> <score>set_system_prompt().448x448and use one image patch.frame_gap=15, which should be equivalent to the README examplecompress_video(..., fps=5)followed byskip=5, i.e. roughly one critic step per second.reverse_eval, we reverse the image pair and then negate the predicted critic, matchingget_trajectory_critic(..., reverse_eval=True).value_listusingcritic_to_value_simplefrom the public code.spearmanr(value_list, np.arange(len(value_list))).0 if VOC * VROC < 0 else 2 * VOC * VROC / (VOC + VROC).We evaluated 1000 DROID test episodes, with 17,614 trajectory steps in each direction.
Results we got
Using
observation.images.exterior_image_1_left:Using
observation.images.exterior_image_2_left:These are far below the DROID number we understood from the paper for the 2B model, around
0.79AOC-F1 / VOC-F1. The main issue seems to be low forward VOC; VROC is high.Questions
Could you clarify the exact DROID evaluation setup used for the reported AOC-F1/VOC-F1 numbers?
observation.images.exterior_image_1_leftobservation.images.exterior_image_2_leftobservation.images.wrist_image_leftfps,skip, andframe_skipsettings were used for DROID?We may simply be using the wrong view or missing a filtering/reference-context detail. Any clarification would be very helpful. Thanks!