When we build a preference dataset, what we should actually be asking is, “Is a world with a model trained on this dataset preferable to a world with a model trained on that dataset?” Of course, this is an intractable question to ask, because doing so would require somehow collecting human labels on every possible arrangement of a training dataset, leading to a combinatorial explosion of options. Instead, we approximate this by collecting human preference signals on each individual data point. But there’s a mismatch: just because humans prefer a more detailed image in one instance doesn’t mean that we’d prefer a world where every single image was maximally detailed.
FROM:thesephist.comEpistemic Calibration and Searching the Space of Truth
Preference tuning tunes models away from being accurate reflections of reality. When we ask a human labeler to choose between one output and another, it’s a poor proxy for the actual thing we want: a rank of the direction the model is pursuing.
Josh Beckman