Note on User Action Sequence Modeling: From Attention to Transformers and Beyond via Samuel Flender

Another key distinction to Transformers is that the attention scores A(X) in the HSTU block are not normalized using softmax. We’ve seen this in DIN, and the reasoning here is the same: we want to the model to be sensitive to the intensity of inputs, that is, the total count of user actions, instead of just their relative frequencies. The underlying reason is that, unlike in LLMs, the corpus of tokens (i.e. ids) is not stationary but instead rapidly evolving, with new tokens constantly being introduced into and old tokens vanishing from the corpus. Softmax does not work well if the space it is normalizing over is constantly evolving.

FROM:
Samuel Flender
User Action Sequence Modeling: From Attention to Transformers and Beyond
Source

Meta’s HSTU, short for Hierarchical Sequential Transduction Units

Reference

Notes
ai
User Action Sequence Modeling: From Attention to Transformers and Beyond
Samuel Flender
2024, August 05, Monday
Permalink
Edit

Widgets

Network Graph

Legend

Keyboard Shortcuts

Key	Action
`o`	Source
`e`	Edit
`i`	Insight
`r`	Random
`h`	Home
`s` or `/`	Search