-
Notifications
You must be signed in to change notification settings - Fork 39
Open
Description
First of all, I have to say that these are phenomenal tutorials!
But I came across the following issue.
In the written tutorial for 02, you note that checkpointing sharded optimizer states is inefficient, so the training script (train_llm.py) does not save optimizer.state_dict() (e.g., to optimizer.pt).
However, the resume logic (using state.json) still tries to load the optimizer checkpoint:
if os.path.exists("state.json"):
optimizer.load_state_dict(torch.load("optimizer.pt"))
Since optimizer.pt is never saved, what is being loaded here?
Should this loading step be removed or updated?
Can you clarify the recommended approach for resuming runs with sharded optimizer states?
Thanks for your work!
Metadata
Metadata
Assignees
Labels
No labels