Skip to content

[runtime] Support Cosyvoice2 Nvidia TensorRT-LLM Inference Solution #1489

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

yuekaizhang
Copy link

This PR supports depoly the cosyvoice2 model using Nvidia TensorRT-LLM and Triton.

Decoding on a single L20 GPU with 26 prompt_audio/target_text pairs (≈221 s of audio):

Mode Note Concurrency Avg Latency (ms) P50 Latency (ms) RTF
Decoupled=True Commit 1 659.87 655.63 0.0891
Decoupled=True Commit 2 1103.16 992.96 0.0693
Decoupled=True Commit 4 1790.91 1668.63 0.0604

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant