<aside> 🌱
Authors: Shuaijie She, Yu Bao, Yu Lu, Lu Xu, Tao Li, Wenhao Zhu, Shujian Huang, Shanbo Cheng, Lu Lu, Yuxuan Wang Affiliations: ByteDance Seed, Nanjing University
</aside>
<aside> đź’Ą
We are thrilled to introduce our latest breakthroughs: DuPO, a novel framework designed to enhance the capabilities of Large Language Models (LLMs) without reliance on human annotations or oracle supervision.
The advancement of Large Language Models (LLMs) has relied heavily on preference optimization techniques that share critical limitations: dependence on external supervision and inherent constraints. RLHF suffers from costly and inconsistent human annotations, while RLAIF merely shifts the noise problem from human annotators to potentially biased judge models. RLVR, though eliminating noisy feedback for objective tasks, remains powerless for free-form generation like translation or creative writing—where multiple valid outputs exist and no single ground truth captures quality. This fundamental bottleneck constrains both scalability and applicability across diverse tasks.
Dual learning, a self-supervised paradigm, has long held promise. It leverages the intrinsic symmetry between a primal task (e.g., translating English to German) and its dual task (e.g., translating German back to English). If the output of the dual task reconstructs the original input, the primal output is considered high-quality. Given that LLMs possess diverse capabilities from extensive pre-training, they could be trained across various tasks. However, applying this framework to LLMs is non-trivial, which faces two critical challenges:
y=8) contains insufficient information to uniquely reconstruct the problem (x="What is 3+5?" or "What is 10-2?" or "What is the atomic number of Oxygen" ) breaking the dual learning cycle.These mismatches render traditional dual learning ill-suited for general LLM optimization, leaving it an open challenge.

To address this, we introduce DuPO, a framework built on a novel generalized duality. Instead of requiring full input reconstruction, DuPO reframes the problem by decomposing the input ($x$) into known ($x_k$) and unknown ($x_u$) components. The dual task is then redesigned: rather than trying to recover the entire input $x$ from the primal output $y$, its objective is to reconstruct only the unknown component $x_u$, using both the primal output $y$ and the known component $x_k$.
Let’s consider a simple mathematical reasoning example ( Fig.c ):
A=5 andB=3. Calculate A+B=? ", the model produces the solution y = 8.$x_k$ = (A=5) as the known component and $x_u$ = (B=3) as the unknown component.