Another key distinction to Transformers is that the attention scores A(X) in the HSTU block are not normalized using softmax. We’ve seen this in DIN, and the reasoning here is the same: we want to the model to be sensitive to the intensity of inputs, that is, the total count of user actions, instead of just their relative frequencies. The underlying reason is that, unlike in LLMs, the corpus of tokens (i.e. ids) is not stationary but instead rapidly evolving, with new tokens constantly being introduced into and old tokens vanishing from the corpus. Softmax does not work well if the space it is normalizing over is constantly evolving.
Samuel FlenderUser Action Sequence Modeling: From Attention to Transformers and Beyond
Meta’s HSTU, short for Hierarchical Sequential Transduction Units
Josh Beckman