A little bit of cleaning of input text (you noticed the effe...

A little bit of cleaning of input text (you noticed the effect some empty space characters had on embeddings) might go a long way: standardise the format your dates so they’re consistent throughout your embeddings; remove trailing spaces wherever you can - you saw the effect they had on the embeddings; the same goes for any other numerical data like prices in different currencies, etc..

Might want to think about speculation strategies for tokenization preprocessing for LLMs.

Comments
www.joshbeckman.org/notes/802725009