-
Notifications
You must be signed in to change notification settings - Fork 788
Description
Describe the bug
What the bug is, and how to reproduce, better with screenshots(描述bug以及复现过程,最好有截图)
KIMI VL SFT的时候如果使用多图数据集就会报错,报错信息如下:
[rank6]: Traceback (most recent call last):
[rank6]: File "/path/to/project/swift/cli/sft.py", line 7, in
[rank6]: sft_main()
[rank6]: File "/path/to/project/swift/llm/train/sft.py", line 284, in sft_main
[rank6]: return SwiftSft(args).main()
[rank6]: File "/path/to/project/swift/llm/base.py", line 47, in main
[rank6]: result = self.run()
[rank6]: File "/path/to/project/swift/llm/train/sft.py", line 150, in run
[rank6]: return self.train(trainer)
[rank6]: File "/path/to/project/swift/llm/train/sft.py", line 210, in train
[rank6]: trainer.train(trainer.args.resume_from_checkpoint)
[rank6]: File "/path/to/project/swift/trainers/mixin.py", line 323, in train
[rank6]: res = super().train(*args, **kwargs)
[rank6]: File ".../site-packages/transformers/trainer.py", line 2245, in train
[rank6]: return inner_training_loop(
[rank6]: File ".../site-packages/transformers/trainer.py", line 2560, in _inner_training_loop
[rank6]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank6]: File ".../site-packages/transformers/trainer.py", line 3736, in training_step
[rank6]: loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)
[rank6]: File "/path/to/project/swift/trainers/trainers.py", line 165, in compute_loss
[rank6]: outputs = model(**inputs)
[rank6]: File ".../site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank6]: return self._call_impl(*args, **kwargs)
[rank6]: File ".../site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank6]: return forward_call(*args, **kwargs)
[rank6]: File ".../site-packages/deepspeed/utils/nvtx.py", line 20, in wrapped_fn
[rank6]: ret_val = func(*args, **kwargs)
[rank6]: File ".../site-packages/deepspeed/runtime/engine.py", line 2054, in forward
[rank6]: loss = self.module(*inputs, **kwargs)
[rank6]: File ".../site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank6]: return self._call_impl(*args, **kwargs)
[rank6]: File ".../site-packages/torch/nn/modules/module.py", line 1844, in _call_impl
[rank6]: return inner()
[rank6]: File ".../site-packages/torch/nn/modules/module.py", line 1779, in inner
[rank6]: args_result = hook(self, args)
[rank6]: File ".../site-packages/deepspeed/utils/nvtx.py", line 20, in wrapped_fn
[rank6]: ret_val = func(*args, **kwargs)
[rank6]: File ".../site-packages/deepspeed/runtime/zero/parameter_offload.py", line 250, in _start_of_forward_hook
[rank6]: self.get_param_coordinator().reset_step()
[rank6]: File ".../site-packages/torch/_dynamo/eval_frame.py", line 632, in _fn
[rank6]: return fn(*args, **kwargs)
[rank6]: File ".../site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 237, in reset_step
[rank6]: assert_ints_same_as_other_ranks([m.ds_id for m in self.__submodule_order])
[rank6]: File ".../site-packages/deepspeed/utils/nvtx.py", line 20, in wrapped_fn
[rank6]: ret_val = func(*args, **kwargs)
[rank6]: File ".../site-packages/deepspeed/runtime/zero/utils.py", line 90, in assert_ints_same_as_other_ranks
[rank6]: raise RuntimeError(f"disagreement between rank0 and rank{dist.get_rank()}: "
[rank6]: RuntimeError: disagreement between rank0 and rank6: rank0: [0, 259, 1, 2, 3, 4, 5, ... 248], rank6: [0, 259, 1, 2, 3, 4, 5, ... 248]
Your hardware and system info
Write your system info like CUDA version/system/GPU/torch version here(在这里给出硬件信息和系统信息,如CUDA版本,系统,GPU型号和torch版本等)
A100 , Linux系统
部分环境如下:
accelerate 1.3.0
deepspeed 0.15.4
peft 0.14.0
torch 2.5.1
torchaudio 2.5.1
torchvision 0.20.1
transformers 4.51.3
transformers-stream-generator 0.0.5
trl 0.16.0.dev0
Additional context
Add any other context about the problem here(在这里补充其他信息)
类似issue:deepspeedai/DeepSpeed#5799
已尝试:
--lazy_tokenize false
--packing false \