Question about reproducing DROID AOC-F1/VOC-F1: camera view and evaluation setup

Hi VLAC team,

Thanks for releasing the model and example code. We are trying to reproduce the DROID AOC-F1 / VOC-F1 style evaluation reported in the paper, but our numbers are much lower than expected, so we suspect we may be missing an important setup detail such as the camera view, reference trajectory setting, or filtering.

## What we tested

We are specifically trying to reproduce the **zero-shot** DROID AOC-F1 / VOC-F1 performance of the released checkpoint. We did not fine-tune the model or train any adapter. We used the released Hugging Face model `InternRobotics/VLAC` and evaluated on DROID trajectories with the VLAC trajectory-critic logic, following the public example/code as closely as possible:

- Single `main_video_path` style input, no multi-view fusion.
- No reference video: `reference_video_path=None` / no `ref_image_list`.
- `think=False`, `rich=False`, `done_flag=False`.
- Official score prompt format from `get_score_prompt(..., trajectory_len=0)`:
  - `Image-1: <image>` / `Image-2: <image>`
  - `Response the relative progressing of target task follow <score>... <task> ... </task> <score>`
- Official system prompt from `set_system_prompt()`.
- Image preprocessing matched the released helper behavior: resize each image to `448x448` and use one image patch.
- DROID videos are 15 fps. We used `frame_gap=15`, which should be equivalent to the README example `compress_video(..., fps=5)` followed by `skip=5`, i.e. roughly one critic step per second.
- For `reverse_eval`, we reverse the image pair and then negate the predicted critic, matching `get_trajectory_critic(..., reverse_eval=True)`.
- We compute `value_list` using `critic_to_value_simple` from the public code.
- VOC/VROC are computed as `spearmanr(value_list, np.arange(len(value_list)))`.
- VOC-F1 / AOC-F1 is computed as `0 if VOC * VROC < 0 else 2 * VOC * VROC / (VOC + VROC)`.

We evaluated 1000 DROID test episodes, with 17,614 trajectory steps in each direction.

## Results we got

Using `observation.images.exterior_image_1_left`:

```text
VOC     = 0.3324
VROC    = 0.8843
VOC-F1  = 0.4832
```

Using `observation.images.exterior_image_2_left`:

```text
VOC     = 0.3314
VROC    = 0.9005
VOC-F1  = 0.4845
```

These are far below the DROID number we understood from the paper for the 2B model, around `0.79` AOC-F1 / VOC-F1. The main issue seems to be low forward VOC; VROC is high.

## Questions

Could you clarify the exact DROID evaluation setup used for the reported AOC-F1/VOC-F1 numbers?

1. Which DROID camera view did you use? For example:
   - `observation.images.exterior_image_1_left`
   - `observation.images.exterior_image_2_left`
   - `observation.images.wrist_image_left`
   - or some selected/rendered view outside these LeRobot field names?
2. Were the DROID AOC-F1/VOC-F1 results computed with a reference trajectory / one-shot context, or with only task text + main video?
3. Was the DROID evaluation split filtered to successful/clean trajectories, or are all episodes included?
4. What exact `fps`, `skip`, and `frame_skip` settings were used for DROID?
5. Is AOC-F1 in the table computed from dataset-level mean AOC/VROC, or averaged per trajectory after computing per-trajectory F1?
6. Is there an official script/config for reproducing the DROID AOC-F1/VOC-F1 evaluation?

We may simply be using the wrong view or missing a filtering/reference-context detail. Any clarification would be very helpful. Thanks!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about reproducing DROID AOC-F1/VOC-F1: camera view and evaluation setup #10

What we tested

Results we got

Questions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Question about reproducing DROID AOC-F1/VOC-F1: camera view and evaluation setup #10

Description

What we tested

Results we got

Questions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions