Introducing DPO-LLaMA
A clean PyTorch implementation of Direct Preference Optimization (DPO) to fine-tune LLaMA models
Reinforcement learning from human feedback (RLHF) [1] plays a very important role in the success of ChatGPT. It generally involves training a reward model to estimate the reward score over some dataset with human preference, and subsequently using RL and PPO algorithms to fine-tune the model to maximize this estimated reward. However, RLHF with LLMs requires a large amount of computation, with a small portion of this computation dedicated to training the reward model, while a large portion is allocated to reinforcement learning (RL). It's also true that integrating RL with LLMs can be quite challenging, since RL itself is already a complex field. If you're interested in RLHF, please check my post on InstructLLaMA.
A new method called Direct Preference Optimization (DPO) [2] suggests we might be able to achieve the same performance while drastically reducing the computation and complexity. Through a series of mathematical manipulations and derivation, the authors of DPO prove that it can solve the standard RLHF problem with only a simple classification loss.
While the authors of the DPO paper also published their code DPO: Direct Preference Optimization. However in our opinion, the implementation and readability of the code have much room for improvement. We are excited to introduce our project DPO-LLaMA, a clean open-source implementation of DPO to fine-tune LLaMA models to follow human preference. The project was implemented in PyTorch and provides comprehensive support for dataset preparation and fine-tuning.