KIMI VL SFT ERROR

**Describe the bug**
What the bug is, and how to reproduce, better with screenshots(描述bug以及复现过程，最好有截图)

KIMI VL SFT的时候如果使用多图数据集就会报错，报错信息如下：
[rank6]: Traceback (most recent call last):
[rank6]:   File "/path/to/project/swift/cli/sft.py", line 7, in <module>
[rank6]:     sft_main()
[rank6]:   File "/path/to/project/swift/llm/train/sft.py", line 284, in sft_main
[rank6]:     return SwiftSft(args).main()
[rank6]:   File "/path/to/project/swift/llm/base.py", line 47, in main
[rank6]:     result = self.run()
[rank6]:   File "/path/to/project/swift/llm/train/sft.py", line 150, in run
[rank6]:     return self.train(trainer)
[rank6]:   File "/path/to/project/swift/llm/train/sft.py", line 210, in train
[rank6]:     trainer.train(trainer.args.resume_from_checkpoint)
[rank6]:   File "/path/to/project/swift/trainers/mixin.py", line 323, in train
[rank6]:     res = super().train(*args, **kwargs)
[rank6]:   File ".../site-packages/transformers/trainer.py", line 2245, in train
[rank6]:     return inner_training_loop(
[rank6]:   File ".../site-packages/transformers/trainer.py", line 2560, in _inner_training_loop
[rank6]:     tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank6]:   File ".../site-packages/transformers/trainer.py", line 3736, in training_step
[rank6]:     loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)
[rank6]:   File "/path/to/project/swift/trainers/trainers.py", line 165, in compute_loss
[rank6]:     outputs = model(**inputs)
[rank6]:   File ".../site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank6]:     return self._call_impl(*args, **kwargs)
[rank6]:   File ".../site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank6]:     return forward_call(*args, **kwargs)
[rank6]:   File ".../site-packages/deepspeed/utils/nvtx.py", line 20, in wrapped_fn
[rank6]:     ret_val = func(*args, **kwargs)
[rank6]:   File ".../site-packages/deepspeed/runtime/engine.py", line 2054, in forward
[rank6]:     loss = self.module(*inputs, **kwargs)
[rank6]:   File ".../site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank6]:     return self._call_impl(*args, **kwargs)
[rank6]:   File ".../site-packages/torch/nn/modules/module.py", line 1844, in _call_impl
[rank6]:     return inner()
[rank6]:   File ".../site-packages/torch/nn/modules/module.py", line 1779, in inner
[rank6]:     args_result = hook(self, args)
[rank6]:   File ".../site-packages/deepspeed/utils/nvtx.py", line 20, in wrapped_fn
[rank6]:     ret_val = func(*args, **kwargs)
[rank6]:   File ".../site-packages/deepspeed/runtime/zero/parameter_offload.py", line 250, in _start_of_forward_hook
[rank6]:     self.get_param_coordinator().reset_step()
[rank6]:   File ".../site-packages/torch/_dynamo/eval_frame.py", line 632, in _fn
[rank6]:     return fn(*args, **kwargs)
[rank6]:   File ".../site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 237, in reset_step
[rank6]:     assert_ints_same_as_other_ranks([m.ds_id for m in self.__submodule_order])
[rank6]:   File ".../site-packages/deepspeed/utils/nvtx.py", line 20, in wrapped_fn
[rank6]:     ret_val = func(*args, **kwargs)
[rank6]:   File ".../site-packages/deepspeed/runtime/zero/utils.py", line 90, in assert_ints_same_as_other_ranks
[rank6]:     raise RuntimeError(f"disagreement between rank0 and rank{dist.get_rank()}: "
[rank6]: RuntimeError: disagreement between rank0 and rank6: rank0: [0, 259, 1, 2, 3, 4, 5, ... 248], rank6: [0, 259, 1, 2, 3, 4, 5, ... 248]


**Your hardware and system info**
Write your system info like CUDA version/system/GPU/torch version here(在这里给出硬件信息和系统信息，如CUDA版本，系统，GPU型号和torch版本等)

A100 , Linux系统
部分环境如下：
accelerate                        1.3.0
deepspeed                         0.15.4
peft                              0.14.0
torch                             2.5.1
torchaudio                        2.5.1
torchvision                       0.20.1
transformers                      4.51.3
transformers-stream-generator     0.0.5
trl                               0.16.0.dev0


**Additional context**
Add any other context about the problem here(在这里补充其他信息)
类似issue：https://github.com/deepspeedai/DeepSpeed/issues/5799


已尝试：
--lazy_tokenize false \
--packing false \


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

KIMI VL SFT ERROR #5218

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

KIMI VL SFT ERROR #5218

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions