[zero-RL] what is it?
Zero-RL applies reinforcement learning RL to base LM directly, eliciting reasoning potentials using models' own rollouts. A fundamental limitation worth highlighting: it is inherently “on-policy”, constraining learning exclusively to the model’s self-generated outputs through iterative trials and feedback cycles. Despite showing promising results, zero-RL is bounded by the base LLM itself.
A key characteristic is that it means a LLM can be trained without Supervised Fine Tuning (SFT).
Another key characteristic, which seems both redundant and self evident.., is that it cannot generate new knowledge.
I say both redundant and self evident as I think it depends on how you define knowledge… Not a can of worms to get into with this!