Note on S1: The $6 R1 Competitor? via Tim Kellogg
In s1, when the LLM tries to stop thinking with
"</think>"
, they force it to keep going by replacing it with"Wait"
. It’ll then begin to second guess and double check it’s answer. They do this to trim or extend thinking time (trimming is just abruptly inserting"</think>"
).It’s really dumb, I love it.
I did this to myself today by typing out a full response to a colleague then stepping back and forcing myself to rethink it.
There are so many simple tricks still to be discovered with LLMs: here, an example of SFT (supervised fine tuning) over RLHF (reinforcement learning human feedback).
Reference
← Previous | Next → |
Note on What Comes Next via Jackie Luo | The Inventors of Deep Research - Latent.Space via swyx & Alessio |