Skip to content

Make corpus_chunk_size tunable for retrieval tasks #4450

@minhnguyent546

Description

@minhnguyent546

Description of the feature

Problem

In MTEB v1, corpus_chunk_size could be tuned via the public evaluation API (e.g. evaluation.run(..., corpus_chunk_size=500)), which was useful to reduce memory usage on large retrieval corpora.

In the current v2 codebase, the SearchEncoderWrapper class accepts corpus_chunk_size as a constructor parameter with a default of 50,000, but this parameter is never passed through the public evaluation API. As a result, there is no way to change this chunk size when running via new evaluation API mteb.evaluate(...).

This is also the case for Bitext mining tasks (BitextMiningEvaluator).

Feature request

Could corpus_chunk_size be made tunable through the public evaluation API again?

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions