Type
Type
Local
From
From
POLARIS-Project
Quantisation
Quantisation
No
Precision
Precision
No
Size
Size
4B
Polaris is an open-source post-training method that uses reinforcement learning to enhance models with advanced reasoning abilities. The approach demonstrates that even smaller models like Qwen3-4B can achieve significant improvements on challenging reasoning tasks through RL optimization, with results surpassing top commercial systems like Claude-4-Opus, Grok-3-Beta, and o3-mini-high on benchmark evaluations. The method incorporates several key techniques including data difficulty analysis and distribution mapping, diversity-based rollout sampling with progressive temperature increases, inference-time length extrapolation for generating longer chain-of-thought reasoning while training with shorter sequences, and multi-stage training to improve exploration efficiency. Polaris leverages open-source data and academic-level resources to push the capabilities of open-recipe reasoning models to new heights.
Polaris is an open-source post-training method that uses reinforcement learning to enhance models with advanced reasoning abilities. The approach demonstrates that even smaller models like Qwen3-4B can achieve significant improvements on challenging reasoning tasks through RL optimization, with results surpassing top commercial systems like Claude-4-Opus, Grok-3-Beta, and o3-mini-high on benchmark evaluations. The method incorporates several key techniques including data difficulty analysis and distribution mapping, diversity-based rollout sampling with progressive temperature increases, inference-time length extrapolation for generating longer chain-of-thought reasoning while training with shorter sequences, and multi-stage training to improve exploration efficiency. Polaris leverages open-source data and academic-level resources to push the capabilities of open-recipe reasoning models to new heights.
