Note on The GPT-3 Architecture, on a Napkin via dugas.ch

To encode the position of the current token in the sequence, the authors take the token’s position (a scalar i, in [0-2047]) and pass it through 12288 sinusoidal functions, each with a different frequency.

The exact reason for why this works is not entirely clear to me. The authors explain it as yielding many relative-position encodings, which is useful for the model. For other possible mental models to analyze this choice: consider the way signals are often represented as sums of periodic samples (see fourier transforms, or SIREN network architecture), or the possibility that language naturally presents cycles of various lengths (for example, poetry).

FROM:
dugas.ch
The GPT-3 Architecture, on a Napkin
Source

Reference

Notes
language, llm
The GPT-3 Architecture, on a Napkin
dugas.ch
2023, April 26, Wednesday
Permalink to 2023.NTE.426
Edit

Widgets

Network Graph

Legend

Keyboard Shortcuts

Key	Action
`o`	Source
`e`	Edit
`i`	Insight
`r`	Random
`h`	Home
`s` or `/`	Search