Skip to content

Question about reproducing DROID AOC-F1/VOC-F1: camera view and evaluation setup #10

@bianhua-12

Description

@bianhua-12

Hi VLAC team,

Thanks for releasing the model and example code. We are trying to reproduce the DROID AOC-F1 / VOC-F1 style evaluation reported in the paper, but our numbers are much lower than expected, so we suspect we may be missing an important setup detail such as the camera view, reference trajectory setting, or filtering.

What we tested

We are specifically trying to reproduce the zero-shot DROID AOC-F1 / VOC-F1 performance of the released checkpoint. We did not fine-tune the model or train any adapter. We used the released Hugging Face model InternRobotics/VLAC and evaluated on DROID trajectories with the VLAC trajectory-critic logic, following the public example/code as closely as possible:

  • Single main_video_path style input, no multi-view fusion.
  • No reference video: reference_video_path=None / no ref_image_list.
  • think=False, rich=False, done_flag=False.
  • Official score prompt format from get_score_prompt(..., trajectory_len=0):
    • Image-1: <image> / Image-2: <image>
    • Response the relative progressing of target task follow <score>... <task> ... </task> <score>
  • Official system prompt from set_system_prompt().
  • Image preprocessing matched the released helper behavior: resize each image to 448x448 and use one image patch.
  • DROID videos are 15 fps. We used frame_gap=15, which should be equivalent to the README example compress_video(..., fps=5) followed by skip=5, i.e. roughly one critic step per second.
  • For reverse_eval, we reverse the image pair and then negate the predicted critic, matching get_trajectory_critic(..., reverse_eval=True).
  • We compute value_list using critic_to_value_simple from the public code.
  • VOC/VROC are computed as spearmanr(value_list, np.arange(len(value_list))).
  • VOC-F1 / AOC-F1 is computed as 0 if VOC * VROC < 0 else 2 * VOC * VROC / (VOC + VROC).

We evaluated 1000 DROID test episodes, with 17,614 trajectory steps in each direction.

Results we got

Using observation.images.exterior_image_1_left:

VOC     = 0.3324
VROC    = 0.8843
VOC-F1  = 0.4832

Using observation.images.exterior_image_2_left:

VOC     = 0.3314
VROC    = 0.9005
VOC-F1  = 0.4845

These are far below the DROID number we understood from the paper for the 2B model, around 0.79 AOC-F1 / VOC-F1. The main issue seems to be low forward VOC; VROC is high.

Questions

Could you clarify the exact DROID evaluation setup used for the reported AOC-F1/VOC-F1 numbers?

  1. Which DROID camera view did you use? For example:
    • observation.images.exterior_image_1_left
    • observation.images.exterior_image_2_left
    • observation.images.wrist_image_left
    • or some selected/rendered view outside these LeRobot field names?
  2. Were the DROID AOC-F1/VOC-F1 results computed with a reference trajectory / one-shot context, or with only task text + main video?
  3. Was the DROID evaluation split filtered to successful/clean trajectories, or are all episodes included?
  4. What exact fps, skip, and frame_skip settings were used for DROID?
  5. Is AOC-F1 in the table computed from dataset-level mean AOC/VROC, or averaged per trajectory after computing per-trajectory F1?
  6. Is there an official script/config for reproducing the DROID AOC-F1/VOC-F1 evaluation?

We may simply be using the wrong view or missing a filtering/reference-context detail. Any clarification would be very helpful. Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions