A little bit of cleaning of input text (you noticed the effect some empty space characters had on embeddings) might go a long way: standardise the format of your dates so they’re consistent throughout your embeddings; remove trailing spaces wherever you can - you saw the effect they had on the embeddings; the same goes for any other numerical data like prices in different currencies, etc.

Might want to think about speculation strategies for tokenization preprocessing for LLMs.


Keyboard Shortcuts

Key Action
o Source
e Edit
i Insight
r Random
s or / Search
www.joshbeckman.org/notes/802725009