AIPO: Improving Training Objective for Iterative Preference Optimization

Yaojie Shen1,2, Xinyao Wang3, Yulei Niu3, Ying Zhou1,2, Lexin Tang3, Libo Zhang1,2, Fan Chen3, Longyin Wen3
1 Institute of Software, Chinese Academy of Sciences 2 University of Chinese Academy of Sciences
3 ByteDance Inc.
Model Arena-Hard
WR ↑
Arena-Hard
Avg. Token
AlpacaEval
2.0 LC
AlpacaEval
Avg. Length
AIPO (Mistral-Large 2407) 81.6 656 67.8 2277
Claude 3.5 Sonnet (06/20) 79.3 567 52.4 1488
GPT-4 Omni (05/13) 79.2 696 57.5 1873
GPT-4o Mini 74.9 668 50.7 1861
AIPO (Llama-3-70B-Instruct) 63.5 616 60.5 2081
AIPO (Gemma-2-27B-It) 63.5 643 57.8 1768
Claude 3 Opus (02/29) 60.4 541 40.5 1388
AIPO (Mistral-Nemo-Instruct-2407) 56.2 597 60.4 2122
Claude 3 Sonnet (02/29) 46.8 552 34.9 1420
GPT-4 (06/13) 37.9 354 30.2 1140
GPT-3.5 Turbo (06/13) 24.8 401 22.7 1328
Claude 2 24.0 295 28.2 1069

Performance of AIPO compared to proprietary models on Arena-Hard and AlpacaEval 2.0.

Abstract

Preference Optimization (PO), is gaining popularity as an alternative choice of Proximal Policy Optimization (PPO) for aligning Large Language Models (LLMs). Recent research on aligning LLMs iteratively with synthetic or partially synthetic data shows promising results in scaling up PO training for both academic settings and proprietary trained models such as Llama3. Despite its success, our study shows that the length exploitation issue present in PO is even more severe in Iterative Preference Optimization (IPO) due to the iterative nature of the process. In this work, we study iterative preference optimization with synthetic data. We share the findings and analysis along the way of building the iterative preference optimization pipeline. More specifically, we discuss the length exploitation issue during iterative preference optimization and propose our training objective for iterative preference optimization, namely Agreement-aware Iterative Preference Optimization (AIPO). To demonstrate the effectiveness of our method, we conduct comprehensive experiments and achieve state-of-the-art performance on MT-Bench, AlpacaEval 2.0, and Arena-Hard.

Iterative Preference Optimization

We bridge the gap from non-iterative preference optimization to iterative preference optimization by analyzing the components and properties of iterative preference optimization with synthetic data. Our study reveals the length issue in iterative preference optimization and its relation to using self-generated responses for training.

Comparison of different training recipes from non-iterative preference optimization to iterative preference optimization.

Our training pipeline for iterative preference optimization.


AIPO: Agreement-Aware Adjustment

By analyzing the properties of training data with self-generated responses and DPO, we propose improving DPO for iterative preference optimization by leveraging the agreement between the reference model and the reward model. We introduce AIPO:

\[\mathcal{L}_{\text{AIPO}}=-\log\sigma \Bigl( \beta \log \frac{\pi_\theta(y_w\mid x)}{\pi_\theta(y_l\mid x)} - (1+\alpha) \beta \log \frac{\pi_\mathrm{ref}(y_w\mid x)}{\pi_\mathrm{ref}(y_l\mid x)} \Bigr) - \frac{\lambda}{\lvert y_w\rvert} \log \bigl( \pi_\theta (y_w \mid x) \bigr)\]

Comparison of different methods under the same length constraint.

Open Source

🌟 We are open-source! We provide the code needed for training and evaluating models, along with other related tools. Our efficient, flexible, and scalable training framework supports iterative training of large-scale models (up to 123B parameters). Explore our repositories: