[zero-RL] Summarising what LUFFY offers

Posted on Apr 29, 2025

Here’s a “standard” progression of training methodologies:

PRE-Training - This is where the model gains broad knowledge, forming the foundation necessary for reasoning.
CPT (Continued Pre-training) - Makes the model knowledgeable about specific domains.
SFT (Supervised Fine-Tuning) - Makes the model skilled at specific tasks by leveraging knowledge it already has.
RL (Reinforcement Learning) - Using methods like GRPO, DPO to align model behavior.

Reasoning traces play different roles at each stage:

They require knowledge from pre-training as a foundation
They implement reasoning for specific tasks during SFT, which the annotator describes as “rigid learning”
They can be aligned/enforced for specific model behavior during RL

LUFFY is combining elements of both SFT (learning from expert reasoning traces) and RL (dynamically balancing imitation and exploration) in a novel way.

Traditional approaches tend to separate these stages, but LUFFY integrates off-policy reasoning guidance (typically associated with SFT) directly into the zero-RL paradigm, getting the benefits of both approaches simultaneously rather than sequentially.

It is not clear if LUFFY beats SFT + RL, however the data shows that it is on par with SFT whilst the temperature is below 0.6.

This is something that really intrigues me, and leaves me with a key question: is the temperature used in a CoT Self-Consistency/Majority Vote-style approach (where any outliers could be averaged out) or is it simply a reference to the inference after training (in which case does applying a CoT Self-Consistency approach also improve the response)?

The cost, especially in comparison to SFT and SFT+RL, is not clear from the video, maybe the paper has it.

Referencing: Off Policy “zero RL” in simple terms