[zero-RL] Summarising what LUFFY offers
Here’s a “standard” progression of training methodologies:
- PRE-Training - This is where the model gains broad knowledge, forming the foundation necessary for reasoning.
- CPT (Continued Pre-training) - Makes the model knowledgeable about specific domains.
- SFT (Supervised Fine-Tuning) - Makes the model skilled at specific tasks by leveraging knowledge it already has.
- RL (Reinforcement Learning) - Using methods like GRPO, DPO to align model behavior.
Reasoning traces play different roles at each stage:
- They require knowledge from pre-training as a foundation
- They implement reasoning for specific tasks during SFT, which the annotator describes as “rigid learning”
- They can be aligned/enforced for specific model behavior during RL
LUFFY is combining elements of both SFT (learning from expert reasoning traces) and RL (dynamically balancing imitation and exploration) in a novel way.
Traditional approaches tend to separate these stages, but LUFFY integrates off-policy reasoning guidance (typically associated with SFT) directly into the zero-RL paradigm, getting the benefits of both approaches simultaneously rather than sequentially.
It is not clear if LUFFY beats SFT + RL, however the data shows that it is on par with SFT whilst the temperature is below 0.6.
This is something that really intrigues me, and leaves me with a key question: is the temperature used in a CoT Self-Consistency/Majority Vote-style approach (where any outliers could be averaged out) or is it simply a reference to the inference after training (in which case does applying a CoT Self-Consistency approach also improve the response)?
The cost, especially in comparison to SFT and SFT+RL, is not clear from the video, maybe the paper has it.
Referencing: Off Policy “zero RL” in simple terms