A little bit of cleaning of input text (you noticed the effect some empty space characters had on embeddings) might go a long way: standardise the format your dates so they’re consistent throughout your embeddings; remove trailing spaces wherever you can - you saw the effect they had on the embeddings; the same goes for any other numerical data like prices in different currencies, etc..
FROM:CybernetistYou Should Probably Pay Attention to Tokenizers
Might want to think about speculation strategies for tokenization preprocessing for LLMs.
Josh Beckman