DuPO: Enabling Reliable LLM Self-Verification via Dual Preference Optimization

<aside> 🌱

Authors: Shuaijie She, Yu Bao, Yu Lu, Lu Xu, Tao Li, Wenhao Zhu, Shujian Huang, Shanbo Cheng, Lu Lu, Yuxuan Wang Affiliations: ByteDance Seed, Nanjing University

</aside>

<aside> 💥

We are thrilled to introduce our latest breakthroughs: DuPO, a novel framework designed to enhance the capabilities of Large Language Models (LLMs) without reliance on human annotations or oracle supervision.

The Core Innovation: DuPO decomposes the primal task input to formulate a corresponding dual task. This dual task involves reconstructing a designated portion of the input—the "unknown part"—conditioned on the primal task's output and the remaining input information. The quality of this reconstruction is then used to evaluate the quality of the primal task output. This mechanism allows DuPO to effectively improve the capabilities of large models across various tasks, such as mathematical reasoning and free-form task like multilingual translation.
Versatile Impressive Enhancement: DuPO enhances mathematical reasoning on various LLMs like OpenReasoning-7B by +6.4 points and improves multilingual translation on one of strongest LLM Seed-X-7B by +2.13 COMET, achieving performance comparable to ultra-large systems.
From Base Model to Reasoning Expert: DuPO can directly awaken latent reasoning capabilities in Qwen3-4B-Base, demonstrating the fundamental power of its self-Verification rewards for capability activation.
Scaling Across Diverse Model: DuPO's effectiveness spans diverse models, from +24.0 points on LLaMA-3.1-8B (surpassing oracle-based SimpleRL-Zoo) to +50.0 points on math-specialized OctoThinker-8B, proving its universal applicability.
Training-Free Performance Scaling: DuPO serves as a powerful ****inference-time reranker and ****improves Qwen3-4B's by 9.3 even without training, directly validating the accuracy and reliability of our self-supervised reward signals </aside>

The Challenge: Scaling LLM Optimization Beyond Annotation

The advancement of Large Language Models (LLMs) has relied heavily on preference optimization techniques that share critical limitations: dependence on external supervision and inherent constraints. RLHF suffers from costly and inconsistent human annotations, while RLAIF merely shifts the noise problem from human annotators to potentially biased judge models. RLVR, though eliminating noisy feedback for objective tasks, remains powerless for free-form generation like translation or creative writing—where multiple valid outputs exist and no single ground truth captures quality. This fundamental bottleneck constrains both scalability and applicability across diverse tasks.

Dual learning, a self-supervised paradigm, has long held promise. It leverages the intrinsic symmetry between a primal task (e.g., translating English to German) and its dual task (e.g., translating German back to English). If the output of the dual task reconstructs the original input, the primal output is considered high-quality. Given that LLMs possess diverse capabilities from extensive pre-training, they could be trained across various tasks. However, applying this framework to LLMs is non-trivial, which faces two critical challenges:

Challenge I: Limited Duality in Non-Mutually Implicative Task. Most real-world LLM tasks (e.g., creative writing, math reasoning) lack strict invertibility.For example, a mathematical solution (y=8) contains insufficient information to uniquely reconstruct the problem (x="What is 3+5?" or "What is 10-2?" or "What is the atomic number of Oxygen" ) breaking the dual learning cycle.
Challenge II: Bidirectional Competence Asymmetry. LLMs might excel at primal tasks but struggle with their duals (e.g., strong at solving math problems but weak at generating problems from solutions), generating noisy feedback that penalizes correct outputs and destabilizes self-supervised learning.

These mismatches render traditional dual learning ill-suited for general LLM optimization, leaving it an open challenge.

DuPO: Dual Learning-based Preference Optimization

To address this, we introduce DuPO, a framework built on a novel generalized duality. Instead of requiring full input reconstruction, DuPO reframes the problem by decomposing the input ($x$) into known ($x_k$) and unknown ($x_u$) components. The dual task is then redesigned: rather than trying to recover the entire input $x$ from the primal output $y$, its objective is to reconstruct only the unknown component $x_u$, using both the primal output $y$ and the known component $x_k$.

Let’s consider a simple mathematical reasoning example ( Fig.c ):

Primal Task: Given the problem "Let A=5 andB=3. Calculate A+B=? ", the model produces the solution y = 8.
Decomposition: We define $x_k$ = (A=5) as the known component and $x_u$ = (B=3) as the unknown component.