R-PRM: Reasoning-Driven Process Reward Modeling

Overview

Process Reward Models (PRMs) evaluate the reasoning process and provide guidance to policy models, which have the potential to further enhance the reasoning capabilities of LLMs. However, current PRMs are limited by their performance.

We propose Reasoning-Driven Process Reward Modeling (R-PRM), a novel approach that enhances LLMs’ ability to evaluate mathematical reasoning step-by-step. Our Framework consists of three parts: supervised cold-start, further meta-optimization in a self-evolving style, and finally inference-time-scaling.

🚀 Our Method achieves impressive performance with great efficiency. When using the same data scale, our approach outperforms Qwen2.5-Math-7B-PRM800K by a margin of 8.7 in F1 Score on ProcessBench. With the help of DPO that requires no additional annotated data, our PRM achieves comparable performance to Qwen2.5-Math-PRM using only ~15% of their data volume.
⭐ We validated the effectiveness of R-PRM in self-improving through preference optimization (Meta-Optimization) and inference-time-scaling. Our performance surpassed that of Llama3.3-70B used for synthetic cold-start data, confirming that our method achieves more than simple distillation of the teacher model
📈 Our model excels at fine-grained evaluation, achieving the best performance among open-source PRMs on PRMBench with fine-grained error category labels. Additionally, under the guidance of our model, the policy model attained the highest average accuracy across six mathematical reasoning datasets using both Best-of-N and Guided Search methods.

🏆 Experiment Results

🧪 Data Efficiency

R-PRM demonstrates exceptional data efficiency under varying training scales:

With just 12.8k training samples, R-PRM reaches 52.6 F1, already surpassing most open-source PRMs.
R-PRM achieves +3.6 F1 over Qwen2.5-Math-7B-PRM800K when trained on just 64k samples (vs. Qwen's 265k), and extends this lead to +8.7 F1 when both are trained on comparable data volumes.
Notably, despite using only ~15% of the data, R-PRM’s performance is already comparable to Qwen2.5-Math-PRM, which was trained on a much larger 1.8M LLM-filtered dataset.

Overview

🏆 Experiment Results

🧪 Data Efficiency

📊 ProcessBench