[zero-RL] Summarising what LUFFY offers

Here’s a “standard” progression of training methodologies:

  1. PRE-Training - This is where the model gains broad knowledge, forming the foundation necessary for reasoning.
  2. CPT (Continued Pre-training) - Makes the model knowledgeable about specific domains.
  3. SFT (Supervised Fine-Tuning) - Makes the model skilled at specific tasks by leveraging knowledge it already has.
  4. RL (Reinforcement Learning) - Using methods like GRPO, DPO to align model behavior.

Reasoning traces play different roles at each stage:

  • They require knowledge from pre-training as a foundation
  • They implement reasoning for specific tasks during SFT, which the annotator describes as “rigid learning”
  • They can be aligned/enforced for specific model behavior during RL

LUFFY is combining elements of both SFT (learning from expert reasoning traces) and RL (dynamically balancing imitation and exploration) in a novel way.

Traditional approaches tend to separate these stages, but LUFFY integrates off-policy reasoning guidance (typically associated with SFT) directly into the zero-RL paradigm, getting the benefits of both approaches simultaneously rather than sequentially.

It is not clear if LUFFY beats SFT + RL, however the data shows that it is on par with SFT whilst the temperature is below 0.6.

This is something that really intrigues me, and leaves me with a key question: is the temperature used in a CoT Self-Consistency/Majority Vote-style approach (where any outliers could be averaged out) or is it simply a reference to the inference after training (in which case does applying a CoT Self-Consistency approach also improve the response)?

The cost, especially in comparison to SFT and SFT+RL, is not clear from the video, maybe the paper has it.

Referencing: Off Policy “zero RL” in simple terms