[runtime] Support Cosyvoice2 Nvidia TensorRT-LLM Inference Solution #1489

yuekaizhang · 2025-07-29T04:05:07Z

This PR supports depoly the cosyvoice2 model using Nvidia TensorRT-LLM and Triton.

Decoding on a single L20 GPU with 26 prompt_audio/target_text pairs (≈221 s of audio):

Mode	Note	Concurrency	Avg Latency (ms)	P50 Latency (ms)	RTF
Decoupled=True	Commit	1	659.87	655.63	0.0891
Decoupled=True	Commit	2	1103.16	992.96	0.0693
Decoupled=True	Commit	4	1790.91	1668.63	0.0604

Yuekai Zhang and others added 6 commits July 22, 2025 06:50

add triton solution

5427c27

clean code

178da09

fix decoupled mode

dc196df

update readme

b44f121

fix commit

1b8d194

fix lint

07cbc51

Provide feedback