Skip to content

KIMI VL SFT ERROR #5218

@MooMoo-Yang

Description

@MooMoo-Yang

Describe the bug
What the bug is, and how to reproduce, better with screenshots(描述bug以及复现过程,最好有截图)

KIMI VL SFT的时候如果使用多图数据集就会报错,报错信息如下:
[rank6]: Traceback (most recent call last):
[rank6]: File "/path/to/project/swift/cli/sft.py", line 7, in
[rank6]: sft_main()
[rank6]: File "/path/to/project/swift/llm/train/sft.py", line 284, in sft_main
[rank6]: return SwiftSft(args).main()
[rank6]: File "/path/to/project/swift/llm/base.py", line 47, in main
[rank6]: result = self.run()
[rank6]: File "/path/to/project/swift/llm/train/sft.py", line 150, in run
[rank6]: return self.train(trainer)
[rank6]: File "/path/to/project/swift/llm/train/sft.py", line 210, in train
[rank6]: trainer.train(trainer.args.resume_from_checkpoint)
[rank6]: File "/path/to/project/swift/trainers/mixin.py", line 323, in train
[rank6]: res = super().train(*args, **kwargs)
[rank6]: File ".../site-packages/transformers/trainer.py", line 2245, in train
[rank6]: return inner_training_loop(
[rank6]: File ".../site-packages/transformers/trainer.py", line 2560, in _inner_training_loop
[rank6]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank6]: File ".../site-packages/transformers/trainer.py", line 3736, in training_step
[rank6]: loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)
[rank6]: File "/path/to/project/swift/trainers/trainers.py", line 165, in compute_loss
[rank6]: outputs = model(**inputs)
[rank6]: File ".../site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank6]: return self._call_impl(*args, **kwargs)
[rank6]: File ".../site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank6]: return forward_call(*args, **kwargs)
[rank6]: File ".../site-packages/deepspeed/utils/nvtx.py", line 20, in wrapped_fn
[rank6]: ret_val = func(*args, **kwargs)
[rank6]: File ".../site-packages/deepspeed/runtime/engine.py", line 2054, in forward
[rank6]: loss = self.module(*inputs, **kwargs)
[rank6]: File ".../site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank6]: return self._call_impl(*args, **kwargs)
[rank6]: File ".../site-packages/torch/nn/modules/module.py", line 1844, in _call_impl
[rank6]: return inner()
[rank6]: File ".../site-packages/torch/nn/modules/module.py", line 1779, in inner
[rank6]: args_result = hook(self, args)
[rank6]: File ".../site-packages/deepspeed/utils/nvtx.py", line 20, in wrapped_fn
[rank6]: ret_val = func(*args, **kwargs)
[rank6]: File ".../site-packages/deepspeed/runtime/zero/parameter_offload.py", line 250, in _start_of_forward_hook
[rank6]: self.get_param_coordinator().reset_step()
[rank6]: File ".../site-packages/torch/_dynamo/eval_frame.py", line 632, in _fn
[rank6]: return fn(*args, **kwargs)
[rank6]: File ".../site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 237, in reset_step
[rank6]: assert_ints_same_as_other_ranks([m.ds_id for m in self.__submodule_order])
[rank6]: File ".../site-packages/deepspeed/utils/nvtx.py", line 20, in wrapped_fn
[rank6]: ret_val = func(*args, **kwargs)
[rank6]: File ".../site-packages/deepspeed/runtime/zero/utils.py", line 90, in assert_ints_same_as_other_ranks
[rank6]: raise RuntimeError(f"disagreement between rank0 and rank{dist.get_rank()}: "
[rank6]: RuntimeError: disagreement between rank0 and rank6: rank0: [0, 259, 1, 2, 3, 4, 5, ... 248], rank6: [0, 259, 1, 2, 3, 4, 5, ... 248]

Your hardware and system info
Write your system info like CUDA version/system/GPU/torch version here(在这里给出硬件信息和系统信息,如CUDA版本,系统,GPU型号和torch版本等)

A100 , Linux系统
部分环境如下:
accelerate 1.3.0
deepspeed 0.15.4
peft 0.14.0
torch 2.5.1
torchaudio 2.5.1
torchvision 0.20.1
transformers 4.51.3
transformers-stream-generator 0.0.5
trl 0.16.0.dev0

Additional context
Add any other context about the problem here(在这里补充其他信息)
类似issue:deepspeedai/DeepSpeed#5799

已尝试:
--lazy_tokenize false
--packing false \

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions