Skip to content

Inconsistency in Optimizer Checkpoint Logic in Tutorial 02 #51

@kennethSty

Description

@kennethSty

First of all, I have to say that these are phenomenal tutorials!

But I came across the following issue.
In the written tutorial for 02, you note that checkpointing sharded optimizer states is inefficient, so the training script (train_llm.py) does not save optimizer.state_dict() (e.g., to optimizer.pt).

However, the resume logic (using state.json) still tries to load the optimizer checkpoint:

if os.path.exists("state.json"):
    optimizer.load_state_dict(torch.load("optimizer.pt"))

Since optimizer.pt is never saved, what is being loaded here?
Should this loading step be removed or updated?
Can you clarify the recommended approach for resuming runs with sharded optimizer states?

Thanks for your work!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions