Citrus-V: Advancing Medical Foundation Models with Unified Medical Image Grounding for Clinical Reasoning
Fangru Zhou Haoqiang Xing Zhenhong Yang
Medical imaging provides critical evidence for clinical diagnosis, treatment planning, and surgical decisions, yet most existing imaging models are narrowly focused and require multiple specialized networks, limiting their generalization. Although large-scale language and multimodal models exhibit strong reasoning and multi-task capabilities, real-world clinical applications demand precise visual grounding, multimodal integration, and chain-of-thought reasoning. We introduce Citrus-V, a multimodal medical foundation model that combines image analysis with textual reasoning. The model integrates detection, segmentation, and multimodal chain-of-thought reasoning, enabling pixel-level lesion localization, structured report generation, and physician-like diagnostic inference in a single framework. We propose a novel multimodal training approach and release a curated open-source data suite covering reasoning, detection, segmentation, and document understanding tasks. Evaluations demonstrate that Citrus-V outperforms existing open-source medical models and expert-level imaging systems across multiple benchmarks, delivering a unified pipeline from visual grounding to clinical reasoning and supporting precise lesion quantification, automated reporting, and reliable second opinions.
Model architecture of Citrus-V. The framework consists of three components: (1) an MLLM—including the LLM, tokenizer, and a vision encoder—for high-level visual-textual reasoning such as report generation, VQA, and grounding; (2) a segmentation projector that maps the "[SEG]" token produced by the MLLM into latent segmentation prompts; and (3) a segmentation model that decodes the latent segmentation prompts together with semantic image features into pixel-level masks. Separate image encoders are employed to decouple low-level details for segmentation from high-level semantics for other tasks, ensuring both types of tasks are optimized without semantic conflict.
- Release Gradio Demo
- Release 33B Model
- Release 73B Model
- Deploy & Inference
To install Citrus-V:
-
Create base environment.
conda create -n citrus_v python=3.10 -y conda activate citrus_v
-
Install requirements.
git clone https://github.com/jdh-algo/Citrus-V.git cd Citrus-V pip install -r requirements_citrus.txt
-
Install flash-attention according to your environment. Here we used
flash-attn==2.7.3
. -
Install Citrus-V training environment. (Based on ms-swift).
pip install -e .
Make sure you have git-lfs installed and download all the following checkpoints to projects/pretrained_weights
.
Download Citrus-V checkpoints:
git lfs install
git clone https://huggingface.co/jdh-algo/Citrus-V-8B-v1.0
We recommend using the official ms-swift documentation to prepare your custom training dataset.
Four Training Stages of the Citrus-V. Concept alignment for stable vision–language mapping, comprehension enhancement for enhanced multimodal reasoning, instruction fine-tuning to strengthen instruction-following ability while encoding segmentation intent, and segmentation fine-tuning to adapt SAM2 for precise medical image segmentation.
It is recommend to train from stage 3 using the pretrained Citrus-V model.
To train the Citrus-V model from scratch, first build the original model using the following scripts:
python architectures/build_citrus_v_model.py
View Complete Training Command
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
NPROC_PER_NODE=8 \
MIN_PIXELS=200704 \
MAX_PIXELS=1003520 \
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
swift sft \
--model {pretrained ckpt address} \
--dataset {your dataset address} \
--template citrus_v \
--train_type full \
--torch_dtype bfloat16 \
--attn_impl flash_attn \
--max_length 12288 \
--num_train_epochs 5 \
--learning_rate 1e-5 \
--warmup_ratio 0 \
--warmup_steps 100 \
--adam_beta1 0.9 \
--adam_beta2 0.999 \
--weight_decay 0.1 \
--max_grad_norm 1.0 \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 8 \
--dataloader_num_workers 64 \
--dataset_num_proc 1 \
--freeze_vit false \
--freeze_aligner false \
--freeze_llm false \
--save_strategy epoch \
--save_total_limit 8 \
--logging_steps 5 \
--output_dir {your model save path}\
--save_only_model \
--gradient_checkpointing true \
--ddp_find_unused_parameters true
View Complete Training Command
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
NPROC_PER_NODE=8 \
MIN_PIXELS=200704 \
MAX_PIXELS=1003520 \
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
swift sft \
--model {pretrained ckpt address} \
--dataset {your dataset address} \
--template citrus_v \
--train_type full \
--torch_dtype bfloat16 \
--attn_impl flash_attn \
--max_length 12288 \
--num_train_epochs 5 \
--learning_rate 1e-5 \
--warmup_ratio 0 \
--warmup_steps 100 \
--adam_beta1 0.9 \
--adam_beta2 0.999 \
--weight_decay 0.1 \
--max_grad_norm 1.0 \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 8 \
--dataloader_num_workers 64 \
--dataset_num_proc 1 \
--freeze_vit true \
--freeze_aligner true \
--freeze_llm true \
--freeze_custom_parameters_json {/path/to/projects/vlm_7B_params.json} \
--save_strategy epoch \
--save_total_limit 8 \
--logging_steps 5 \
--output_dir {your model save path}\
--save_only_model \
--gradient_checkpointing true \
--ddp_find_unused_parameters true
- Deploy
CUDA_VISIBLE_DEVICES=0,1,2,3 \
MAX_PIXELS=65535 \
VIDEO_MAX_PIXELS=50176 \
FPS_MAX_FRAMES=12 \
swift deploy \
--model /path/to/Citrus-V-8B-v1.0 \
--served_model_name CitrusV_8B \
--template citrus_v_infer \
--infer_backend pt \
--torch_dtype bfloat16 \
--port 8000
- Inference with Deployment
cd projects
python inference_with_deploy.py
- Inference with ms-swift PtEngine
python inference.py --model /path/to/Citrus-V-8B-v1.0
This project is licensed under the Apache License (Version 2.0). For models and datasets, please refer to the original resource page and follow the corresponding License.
If you use Citrus-V in your research, please cite our work:
@misc{wang2025citrusvadvancingmedicalfoundation,
title={Citrus-V: Advancing Medical Foundation Models with Unified Medical Image Grounding for Clinical Reasoning},
author={Guoxin Wang, Jun Zhao, Xinyi Liu, Yanbo Liu, Xuyang Cao, Chao Li, Zhuoyun Liu, Qintian Sun, Fangru Zhou, Haoqiang Xing and Zhenhong Yang},
year={2025},
eprint={2509.19090},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2509.19090},
}
We would like to thank the contributors to the ms-swift, SA2VA, SAM2, Qwen2.5-VL, and mmdetection repositories, for their open research and extraordinary work.