Deep Learning
Tuesday, April 29, 2025
Here’s a “standard” progression of training methodologies:
PRE-Training - This is where the model gains broad knowledge, forming the foundation necessary for reasoning. CPT (Continued Pre-training) - Makes the model knowledgeable about specific domains. SFT (Supervised Fine-Tuning) - Makes the model skilled at specific tasks by leveraging knowledge it already has. RL (Reinforcement Learning) - Using methods like GRPO, DPO to align model behavior. Reasoning traces play different roles at each stage:
Continue reading →
Tuesday, April 29, 2025
Source: Off Policy “zero RL” in simple terms
Results demonstrate that LUFFY encourages the model to imitate high-quality reasoning traces while maintaining exploration of its own sampling space.
Authors introduce policy shaping via regularized importance sampling, which amplifies learning signals for low-probability yet crucial actions under “off-policy” guidance.
The aspect that is still not clear to me is how there is any exploration of the solution space.
Continue reading →
Monday, April 28, 2025
Based on conventional zero-RL methods such as GRPO, LUFFY introduces off-policy reasoning traces (e.g., from DeepSeek-R1) and combines them with models' on-policy roll-outs before advantage computation.
… However, naively combining off-policy traces can lead to overly rapid convergence and entropy collapse, causing the model to latch onto superficial patterns rather than acquiring genuine reasoning capabilities.
…genuine reasoning capabilities… I am not certain if the implication is that Deepseek-R1 can reason or that it is a reminder that no model cam genuinely reason.
Continue reading →
Monday, April 28, 2025
Zero-RL applies reinforcement learning RL to base LM directly, eliciting reasoning potentials using models' own rollouts. A fundamental limitation worth highlighting: it is inherently “on-policy”, constraining learning exclusively to the model’s self-generated outputs through iterative trials and feedback cycles. Despite showing promising results, zero-RL is bounded by the base LLM itself.
A key characteristic is that it means a LLM can be trained without Supervised Fine Tuning (SFT).
Continue reading →
Monday, April 28, 2025
You are doing Imitation Learning (specifically Behavioral Cloning) because the goal and mechanism involve mimicking the expert’s token sequences.
You are doing Transfer Learning (specifically Knowledge Distillation) because you are transferring reasoning knowledge from a teacher model to a student model.
You are not doing Off-Policy Reinforcement Learning because the learning process is supervised likelihood maximization, not reward maximization using RL algorithms.
Although the data itself is “off-policy” (not generated by the model being trained), the learning paradigm is supervised imitation, not RL.
Continue reading →
Saturday, April 26, 2025
Support Vector Machines (SVM) are a mathematical approach for classifying data by finding optimal separating hyperplanes, applicable even in non-linear scenarios using kernel methods.
Continue reading →