Skip to content

[Feature]: Fleet Retry #2921

@ASmedberg-woolpert

Description

@ASmedberg-woolpert

Problem

Currently if I spin up a fleet in some sort of volatile environment, the most common example of which would be a fleet of spot instances where instances can be canceled by the service provider at any moment, or any other scenario where failure is expected, my fleet can lose instances entirely outside of the control of dstack and become unusable.

This is specifically useful in the scenario where I want to limit the number of instances being used with an unbounded number of runs. Although as @peterschmidt85 pointed out in discord, it would also be useful in some scenarios during the startup of a fleet.

Solution

Adding something similar (or identical) to the run::retry_on system to keep fleets at their expected instance count.

Workaround

I think one workaround might be (although I have not tested it yet) to create the fleet, set its max number of runners, and then specifically assign workers to it with creation-policy: reuse-or-create

If my reading of the documentation is correct, I believe each run would attempt to create a new instance for itself, but only succeed if it didn't exceed the max number of instances in the fleet. While this would not maintain running machines at the fleet instance count, it would effectively be the same as any missing instance would spin up as soon as a job attempted to reuse-or-create on the fleet and there were fewer instances than max count.

Would you like to help us implement this feature by sending a PR?

No

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions