Running gpt-oss-120b model with llama.cpp on H100 GPUs? #16198
Unanswered
bartlettroscoe
asked this question in
Q&A
Replies: 1 comment
-
The numerical issues described in the linked issue were specifically due to the use of |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Has anyone had success running the gpt-oss-120b model on NVIDIA H100 GPUs? I can't find any evidence of anyone using llama.cpp to run the gpt-oss-120b model on an H100 GPU, even though there is lots of talk about gpt-oss-120b running on an H100, like:
However, that post mentions vLLM and vLLM that does not support tool calling with the gpt-oss models, so you can't use vLLM to serve the gpt-oss models and use them with an agentic coding agent like Codex CLI (OpenAI's own coding agent). See:
So that leaves us with llama.cpp to try to run the gpt-oss models on H100s (and we actually have a bunch of H100s that we can use). However, when I tried to build and run llama.cpp to serve the gpt-oss-20b and gpt-oss-120b models on our H100s (using
llama-server
), we are getting getting gibberish from the model output like reported at:This seems like it might be some type of numerical problem on this machine or with the CUDA version we are using?
NOTE: There is evidence that people have run llama.cpp to run other models on H100s like:
Has anyone had any luck getting these gpt-oss models to run on H100s with llama.cpp?
FYI: See the identical Reddit post:
Beta Was this translation helpful? Give feedback.
All reactions