Note on You Should Probably Pay Attention to Tokenizers via Cybernetist
A little bit of cleaning of input text (you noticed the effect some empty space characters had on embeddings) might go a long way: standardise the format of your dates so they’re consistent throughout your embeddings; remove trailing spaces wherever you can - you saw the effect they had on the embeddings; the same goes for any other numerical data like prices in different currencies, etc.
Might want to think about speculation strategies for tokenization preprocessing for LLMs.
Reference
-
Permalink (
2024.NTE.200) - On
- In Notes
- Tagged llm, data
- From You Should Probably Pay Attention to Tokenizers
- Edit
| ← Previous | Next → |
| Justice at Radius | Functional Strength Training 🏋️ |