[zero-RL] what is it?

Monday, April 28, 2025

Zero-RL applies reinforcement learning RL to base LM directly, eliciting reasoning potentials using models' own rollouts. A fundamental limitation worth highlighting: it is inherently “on-policy”, constraining learning exclusively to the model’s self-generated outputs through iterative trials and feedback cycles. Despite showing promising results, zero-RL is bounded by the base LLM itself.

A key characteristic is that it means a LLM can be trained without Supervised Fine Tuning (SFT).

Another key characteristic, which seems both redundant and self evident.., is that it cannot generate new knowledge.

I say both redundant and self evident as I think it depends on how you define knowledge… Not a can of worms to get into with this!

Source: Off Policy “zero-RL” explained in simple terms

Learning Reinforcement Learning Deep Learning

✴️ Also on Micro.blog
✍️ Reply by email