Loss fn might be wrong

![image](https://github.com/user-attachments/assets/302b1c1e-5924-4c2b-9a32-5b74516b8ec5)
![image](https://github.com/user-attachments/assets/9e979ac8-3442-4ba2-b711-ada5a01d0509)
![image](https://github.com/user-attachments/assets/9da88f68-887a-41b7-8e07-78405d367b0e)
They use the common ppo loss updating. So the algorithm is actually maximise the rewards. But the loss make the experts to be closer to zeros, in this case, the logit (discriminator.forward()) will be 0 if it is an expert behaviour. The logit will be 1 if it is a fake behaviour. The reward is actually the log_logit, would be bigger if it is fake.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Loss fn might be wrong #1

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Loss fn might be wrong #1

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions