-
Notifications
You must be signed in to change notification settings - Fork 311
Open
Description
System Info
https://forums.developer.nvidia.com/t/gpu-has-fallen-off-the-bus/217357
Currently we already return Ok()
irrespective if gpu fails
fn health(&self) -> Result<(), BackendError> { |
Willing to contribute a simple candle equivalent of
import torch # candle
try:
torch.Tensor([2]).cuda() ** 2
return "healthy"
except:
return "error"
Other option:
return err if the last 3 consecutive candle requests fail.
Information
- Docker
- The CLI directly
Tasks
- An officially supported command
- My own modifications
Reproduction
e.g. overheat disconnect your GPU.
Expected behavior
Metadata
Metadata
Assignees
Labels
No labels