There are some whitepapaers showing that LLMs predict the results of social science survey experiments, with (r = 0.85, adj r = 0.91) across 70 studies, with (r = .9, adj r = .94) for the unpublished studies. If these weren’t surveys I would be highly suspicious because this would be implying the results could reliably replicate at all. If it’s only surveys, sure, I suppose surveys should replicate.
This suggests that we mostly do not actually need the surveys, we can get close (r ~ 0.9) by asking GPT-4, and that will only improve over time.
This also suggests that we can use this to measure question bias and framing. You can simulate five different versions of the survey, and see how the simulated responses change. This could also let one do a meta-analysis of sorts, with ‘wording-adjusted’ results.
TheZviAI #76: Six Shorts Stories About OpenAI
I would expect this to work only for topics that people have expressed opinions about in the corpus/training text. But still! The bias calibration alone is very useful.
Josh Beckman