-
Notifications
You must be signed in to change notification settings - Fork 6
Open
Description
They use the common ppo loss updating. So the algorithm is actually maximise the rewards. But the loss make the experts to be closer to zeros, in this case, the logit (discriminator.forward()) will be 0 if it is an expert behaviour. The logit will be 1 if it is a fake behaviour. The reward is actually the log_logit, would be bigger if it is fake.
Metadata
Metadata
Assignees
Labels
No labels