<aside> 🌱

Authors: Shuaijie She, Yu Bao, Yu Lu, Lu Xu, Tao Li, Wenhao Zhu, Shujian Huang, Shanbo Cheng, Lu Lu, Yuxuan Wang Affiliations: ByteDance Seed, Nanjing University

</aside>


<aside> đź’Ą

We are thrilled to introduce our latest breakthroughs: DuPO, a novel framework designed to enhance the capabilities of Large Language Models (LLMs) without reliance on human annotations or oracle supervision.


The Challenge: Scaling LLM Optimization Beyond Annotation

The advancement of Large Language Models (LLMs) has relied heavily on preference optimization techniques that share critical limitations: dependence on external supervision and inherent constraints. RLHF suffers from costly and inconsistent human annotations, while RLAIF merely shifts the noise problem from human annotators to potentially biased judge models. RLVR, though eliminating noisy feedback for objective tasks, remains powerless for free-form generation like translation or creative writing—where multiple valid outputs exist and no single ground truth captures quality. This fundamental bottleneck constrains both scalability and applicability across diverse tasks.

Dual learning, a self-supervised paradigm, has long held promise. It leverages the intrinsic symmetry between a primal task (e.g., translating English to German) and its dual task (e.g., translating German back to English). If the output of the dual task reconstructs the original input, the primal output is considered high-quality. Given that LLMs possess diverse capabilities from extensive pre-training, they could be trained across various tasks. However, applying this framework to LLMs is non-trivial, which faces two critical challenges:

  1. Challenge I: Limited Duality in Non-Mutually Implicative Task. Most real-world LLM tasks (e.g., creative writing, math reasoning) lack strict invertibility.For example, a mathematical solution (y=8) contains insufficient information to uniquely reconstruct the problem (x="What is 3+5?" or "What is 10-2?" or "What is the atomic number of Oxygen" ) breaking the dual learning cycle.
  2. Challenge II: Bidirectional Competence Asymmetry. LLMs might excel at primal tasks but struggle with their duals (e.g., strong at solving math problems but weak at generating problems from solutions), generating noisy feedback that penalizes correct outputs and destabilizes self-supervised learning.

These mismatches render traditional dual learning ill-suited for general LLM optimization, leaving it an open challenge.

image.png

DuPO: Dual Learning-based Preference Optimization

To address this, we introduce DuPO, a framework built on a novel generalized duality. Instead of requiring full input reconstruction, DuPO reframes the problem by decomposing the input ($x$) into known ($x_k$) and unknown ($x_u$) components. The dual task is then redesigned: rather than trying to recover the entire input $x$ from the primal output $y$, its objective is to reconstruct only the unknown component $x_u$, using both the primal output $y$ and the known component $x_k$.

Let’s consider a simple mathematical reasoning example ( Fig.c ):