Abstract
Preference Optimization (PO), is gaining popularity as an alternative choice of Proximal Policy Optimization (PPO) for aligning Large Language Models (LLMs). Recent research on aligning LLMs iteratively with synthetic or partially synthetic data shows promising results in scaling up PO training for both academic settings and proprietary trained models such as Llama3. Despite its success, our study shows that the length exploitation issue present in PO is even more severe in Iterative Preference Optimization (IPO) due to the iterative nature of the process. In this work, we study iterative preference optimization with synthetic data. We share the findings and analysis along the way of building the iterative preference optimization pipeline. More specifically, we discuss the length exploitation issue during iterative preference optimization and propose our training objective for iterative preference optimization, namely Agreement-aware Iterative Preference Optimization (AIPO). To demonstrate the effectiveness of our method, we conduct comprehensive experiments and achieve state-of-the-art performance on MT-Bench, AlpacaEval 2.0, and Arena-Hard.
Iterative Preference Optimization
We bridge the gap from non-iterative preference optimization to iterative preference optimization by analyzing the components and properties of iterative preference optimization with synthetic data. Our study reveals the length issue in iterative preference optimization and its relation to using self-generated responses for training.
Comparison of different training recipes from non-iterative preference optimization to iterative preference optimization.
Our training pipeline for iterative preference optimization.
AIPO: Agreement-Aware Adjustment
By analyzing the properties of training data with self-generated responses and DPO, we propose improving DPO for iterative preference optimization by leveraging the agreement between the reference model and the reward model. We introduce AIPO:
\[\mathcal{L}_{\text{AIPO}}=-\log\sigma \Bigl( \beta \log \frac{\pi_\theta(y_w\mid x)}{\pi_\theta(y_l\mid x)} - (1+\alpha) \beta \log \frac{\pi_\mathrm{ref}(y_w\mid x)}{\pi_\mathrm{ref}(y_l\mid x)} \Bigr) - \frac{\lambda}{\lvert y_w\rvert} \log \bigl( \pi_\theta (y_w \mid x) \bigr)\]
Comparison of different methods under the same length constraint.
Open Source
🌟 We are open-source! We provide the code needed for training and evaluating models, along with other related tools. Our efficient, flexible, and scalable training framework supports iterative training of large-scale models (up to 123B parameters). Explore our repositories: