[Feature]: Fleet Retry

### Problem

Currently if I spin up a fleet in some sort of volatile environment, the most common example of which would be a fleet of spot instances where instances can be canceled by the service provider at any moment, or any other scenario where failure is expected, my fleet can lose instances entirely outside of the control of dstack and become unusable.

This is specifically useful in the scenario where I want to limit the number of instances being used with an unbounded number of runs. Although as @peterschmidt85 pointed out in discord, it would also be useful in some scenarios during the startup of a fleet. 

### Solution

Adding something similar (or identical) to the `run::retry_on` system to keep fleets at their expected instance count. 

### Workaround

I think one workaround might be (although I have not tested it yet) to create the fleet, set its max number of runners, and then specifically assign workers to it with `creation-policy: reuse-or-create` 

If my reading of the documentation is correct, I believe each run would attempt to create a new instance for itself, but only succeed if it didn't exceed the max number of instances in the fleet. While this would not maintain running machines at the fleet instance count, it would effectively be the same as any missing instance would spin up as soon as a job attempted to `reuse-or-create` on the fleet and there were fewer instances than max count. 

### Would you like to help us implement this feature by sending a PR?

No

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature]: Fleet Retry #2921

Problem

Solution

Workaround

Would you like to help us implement this feature by sending a PR?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature]: Fleet Retry #2921

Description

Problem

Solution

Workaround

Would you like to help us implement this feature by sending a PR?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions