[zero-RL] LUFFY: Learning to reason Under oFF policY guidance

Based on conventional zero-RL methods such as GRPO, LUFFY introduces off-policy reasoning traces (e.g., from DeepSeek-R1) and combines them with models' on-policy roll-outs before advantage computation.

… However, naively combining off-policy traces can lead to overly rapid convergence and entropy collapse, causing the model to latch onto superficial patterns rather than acquiring genuine reasoning capabilities.

…genuine reasoning capabilities… I am not certain if the implication is that Deepseek-R1 can reason or that it is a reminder that no model cam genuinely reason.

This is another can of worms, that I do not have a clear position on yet, and won’t get into here.

Source: Off Policy “zero-RL” explained in simple terms