Inconsistency in Optimizer Checkpoint Logic in Tutorial 02

First of all, I have to say that these are phenomenal tutorials!

But I came across the following issue.
In the written tutorial for 02, you note that checkpointing sharded optimizer states is inefficient, so the training script (train_llm.py) does not save optimizer.state_dict() (e.g., to optimizer.pt).

However, the resume logic (using state.json) still tries to load the optimizer checkpoint:
```
if os.path.exists("state.json"):
    optimizer.load_state_dict(torch.load("optimizer.pt"))

```
Since optimizer.pt is never saved, what is being loaded here?
Should this loading step be removed or updated?
Can you clarify the recommended approach for resuming runs with sharded optimizer states?

Thanks for your work!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Inconsistency in Optimizer Checkpoint Logic in Tutorial 02 #51

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Inconsistency in Optimizer Checkpoint Logic in Tutorial 02 #51

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions