This is the repository of CalliReader: Contextualizing Chinese Calligraphy via an Embedding-aligned Vision Language Model.
CalliReader is a novel plug-and-play Vision-Language Model (VLM) specifically designed to interpret calligraphic artworks with diverse styles and layouts, leveraging slicing priors, embedding alignment, and effective fine-tuning. It demonstrates remarkable performances on Chinese Calligraphy recognition and understanding, while also retains excellent OCR ability on general scenes.
For more information, please visit our project page (Unfinished).
- 2025.10.6 We are releasing our CalliBench.
- 2025.8.10 We are releasing our training scripts.
- 2025.6.26 Our newest model is now available on HuggingFace.
- 2025.6.25 Our work has been accepted by ICCV 2025!
- 2025.2.12 The repository has been updated.
We are releasing our network and checkpoints. You can download weights of our CalliReader from this HuggingFace link. Finetuned VLM weight files that end with .safetensors are stored in the folder InternVL, and all pluggable modules can be found in the folder params. You can download those files and put them in the same folder of the cloned repository.
You can setup the pipeline under the following guidance.
- We recommend creating a conda environment with Python>=3.9 and activate it:
conda create -n callireader python=3.9
conda activate callireader
- Then, install essential dependencies:
pip install requirements.txt
- Finally, install the package
flash-attn:
pip install flash-attn
If you encounter certain problems with this package, you can download .whl file here for direct installation:
pip install flash_attn-xxx.whl
Please note that this package only supports Linux systems with CUDA installed, and all of their versions should be matched. For further issues about flash-attn, please turn to its repository for help.
We have verified that .jpg and .png format images are well supported.
- For a single image, use
python inference.py --tgt=<image path>
The result will be output directly in the terminal.
- For a folder with multiple images, use
python inference.py --tgt=<folder path> --save_name=<your save name>
Results will be saved to ./results/<your save name>.json.
Data of Full-page Recognition, Region-wise OCR, Choice Questions (Author, Style, and Layout), Bilingual Interpretation, and Intent Analysis can be downloaded in this link. It contains 3,192 image-annotation samples in total, and we use them to construct our CalliBench.
Data of 7,357 samples for e-IT can be downloaded in this link.
Please refer to the train folder for further instructions.
Run evaluate.py to assess the model on our CalliBench. You should first download the dataset and then run
python evaluate.py --type=<Eval type> --data=<CalliBench path> --save_name=<Test name>
to evaluate the model on various Calligraphy-related tasks. For example, run
python evaluate.py --type=full_page --data=./Callibench --save_name=exp
to test the model on full-page recognition task.
For bilingual interpretation and intent analysis tasks, please refer to train folder for example codes.
@InProceedings{Luo_2025_ICCV,
author = {Luo, Yuxuan and Tang, Jiaqi and Huang, Chenyi and Hao, Feiyang and Lian, Zhouhui},
title = {CalliReader: Contextualizing Chinese Calligraphy via an Embedding-Aligned Vision-Language Model},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
month = {October},
year = {2025},
pages = {23030-23040}
}
