-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Hi!
I have a small question "Attention module" part of your code.
Before passing final attention linear layer, there is tanh for non-linearity not ReLU.
And "flattened.shape[1]**-0.5" is multiplied after final attention.
Is there a special reason for using tanh not ReLU?
And why is that value multiplied?
Original code below.
att = self.f_att(torch.tanh(att_enc+att_dec))*flattened.shape[1]**-0.5 # att.shape = (batch, locations, 1)
Metadata
Metadata
Assignees
Labels
No labels