[zero-RL] When you SFT a smaller LM on the reasoning traces of a larger LM

You are doing Imitation Learning (specifically Behavioral Cloning) because the goal and mechanism involve mimicking the expert’s token sequences.

You are doing Transfer Learning (specifically Knowledge Distillation) because you are transferring reasoning knowledge from a teacher model to a student model.

You are not doing Off-Policy Reinforcement Learning because the learning process is supervised likelihood maximization, not reward maximization using RL algorithms.

Although the data itself is “off-policy” (not generated by the model being trained), the learning paradigm is supervised imitation, not RL.

Source: Off Policy “zero-RL” explained in simple terms