In s1, when the LLM tries to stop thinking with "</think>", they force it to keep going by replacing it with "Wait". It’ll then begin to second guess and double check it’s answer. They do this to trim or extend thinking time (trimming is just abruptly inserting "</think>").

It’s really dumb, I love it.

I did this to myself today by typing out a full response to a colleague then stepping back and forcing myself to rethink it.

There are so many simple tricks still to be discovered with LLMs: here, an example of SFT (supervised fine tuning) over RLHF (reinforcement learning human feedback).


Keyboard Shortcuts

Key Action
o Source
e Edit
i Insight
r Random
h Home
s or / Search
www.joshbeckman.org/notes/848621117