-
Notifications
You must be signed in to change notification settings - Fork 182
Description
Problem
Currently if I spin up a fleet in some sort of volatile environment, the most common example of which would be a fleet of spot instances where instances can be canceled by the service provider at any moment, or any other scenario where failure is expected, my fleet can lose instances entirely outside of the control of dstack and become unusable.
This is specifically useful in the scenario where I want to limit the number of instances being used with an unbounded number of runs. Although as @peterschmidt85 pointed out in discord, it would also be useful in some scenarios during the startup of a fleet.
Solution
Adding something similar (or identical) to the run::retry_on
system to keep fleets at their expected instance count.
Workaround
I think one workaround might be (although I have not tested it yet) to create the fleet, set its max number of runners, and then specifically assign workers to it with creation-policy: reuse-or-create
If my reading of the documentation is correct, I believe each run would attempt to create a new instance for itself, but only succeed if it didn't exceed the max number of instances in the fleet. While this would not maintain running machines at the fleet instance count, it would effectively be the same as any missing instance would spin up as soon as a job attempted to reuse-or-create
on the fleet and there were fewer instances than max count.
Would you like to help us implement this feature by sending a PR?
No