DPO (Direct Preference Optimization)

Direct Preference Optimization (DPO) represents a paradigm in machine learning where the optimization of a model is guided directly by human preferences or choices rather than predefined metrics or loss functions. This approach is particularly useful in scenarios where defining an explicit objective function is challenging or inapplicable, such as in aesthetic or subjective assessments (e.g., music or visual art preferences). DPO typically involves presenting pairs of options to users and optimizing the model based on their preferences, using these comparative judgments to guide the learning process. This technique aligns the model more closely with human-like decision-making processes and can be used to tailor systems that are sensitive to nuanced human judgments.

The concept of optimizing based on direct preferences has been explored in various forms over the years, but the formalization of DPO in machine learning contexts has gained more traction in the 2010s as part of the broader exploration into interactive and human-in-the-loop machine learning systems.

While there isn’t a single pioneer of DPO, the development has been influenced by the fields of preference learning and interactive machine learning, with contributions from researchers across these areas. The method builds on foundational work in algorithms that learn from human feedback, which have been advanced by numerous researchers in the AI community.

DPO
Direct Preference Optimization

Newsletter

Academic Papers

Direct preference optimization: Your language model is secretly a reward model

Beyond one-preference-for-all: Multi-objective direct preference optimization

Direct preference optimization with an offset

Filtered direct preference optimization

Mixed preference optimization: Reinforcement learning with data selection and better reference model

DPODirect Preference Optimization