varun_damn

joined 2 years ago

The GPT-3 Architecture, on a Napkin in c/[email protected]

[–] [email protected] 1 points 1 year ago

@behohippy @saint Instead of timestep by timestep sequence modeling the attention allows us to pass sequential model in a parallel NN just like fully connected one, where the positional encoding helps us to know the sequence of each and we can remove the keys having less attention value...

permalink
fedilink
source
context