In s1, when the LLM tries to stop thinking with "</think>"
, they force it to keep going by replacing it with "Wait"
. It’ll then begin to second guess and double check it’s answer. They do this to trim or extend thinking time (trimming is just abruptly inserting "</think>"
).
It’s really dumb, I love it.
I did this to myself today by typing out a full response to a colleague then stepping back and forcing myself to rethink it.
There are so many simple tricks still to be discovered with LLMs: here, an example of SFT (supervised fine tuning) over RLHF (reinforcement learning human feedback).
Josh Beckman
Widgets
Insight
This widget generates “insights” about a post - you can read about how it works.