Note on You Should Probably Pay Attention to Tokenizers via Cybernetist

A little bit of cleaning of input text (you noticed the effect some empty space characters had on embeddings) might go a long way: standardise the format of your dates so they’re consistent throughout your embeddings; remove trailing spaces wherever you can - you saw the effect they had on the embeddings; the same goes for any other numerical data like prices in different currencies, etc.

FROM:
Cybernetist
You Should Probably Pay Attention to Tokenizers
Source

Might want to think about speculation strategies for tokenization preprocessing for LLMs.

Reference

Notes
llm, data
You Should Probably Pay Attention to Tokenizers
Cybernetist
2024, October 24, Thursday
Permalink
Edit

Widgets

Network Graph

Legend

Keyboard Shortcuts

Key	Action
`o`	Source
`e`	Edit
`i`	Insight
`r`	Random
`h`	Home
`s` or `/`	Search