New paper with Stewart Slocum and Dylan Hadfield-Menell is out! We find that alignment algorithms like RLHF and DPO significantly reduce the diversity of LLM outputs during alignment posttraining, and that the KL divergence regularizer causes models to systematically overweight majority opinions and sacrifice diversity. (paper)


We introduce Soft Preference Learning, which decouples entropy and cross-entropy components in the KL penalty to enable finer control over output diversity. Models trained with this approach achieve higher accuracy on difficult repeated sampling tasks and demonstrate greater semantic and lexical diversity. The trained models can also represent a wider range of societal viewpoints and show improved logit calibration.


Accepted to ICLR 2025.