Reinforcement Learning

[zero-RL] Summarising what LUFFY offers

Here’s a “standard” progression of training methodologies: PRE-Training - This is where the model gains broad knowledge, forming the foundation necessary for reasoning. CPT (Continued Pre-training) - Makes the model knowledgeable about specific domains. SFT (Supervised Fine-Tuning) - Makes the model skilled at specific tasks by leveraging knowledge it already has. RL (Reinforcement Learning) - Using methods like GRPO, DPO to align model behavior. Reasoning traces play different roles at each stage:

Continue reading →

[zero-RL] where is the exploration?

Source: Off Policy “zero RL” in simple terms Results demonstrate that LUFFY encourages the model to imitate high-quality reasoning traces while maintaining exploration of its own sampling space. Authors introduce policy shaping via regularized importance sampling, which amplifies learning signals for low-probability yet crucial actions under “off-policy” guidance. The aspect that is still not clear to me is how there is any exploration of the solution space.

Continue reading →

[zero-RL] LUFFY: Learning to reason Under oFF policY guidance

Based on conventional zero-RL methods such as GRPO, LUFFY introduces off-policy reasoning traces (e.g., from DeepSeek-R1) and combines them with models' on-policy roll-outs before advantage computation. … However, naively combining off-policy traces can lead to overly rapid convergence and entropy collapse, causing the model to latch onto superficial patterns rather than acquiring genuine reasoning capabilities. …genuine reasoning capabilities… I am not certain if the implication is that Deepseek-R1 can reason or that it is a reminder that no model cam genuinely reason.

Continue reading →

[zero-RL] what is it?

Zero-RL applies reinforcement learning RL to base LM directly, eliciting reasoning potentials using models' own rollouts. A fundamental limitation worth highlighting: it is inherently “on-policy”, constraining learning exclusively to the model’s self-generated outputs through iterative trials and feedback cycles. Despite showing promising results, zero-RL is bounded by the base LLM itself. A key characteristic is that it means a LLM can be trained without Supervised Fine Tuning (SFT).

Continue reading →

A speculative recipe for useful agentic behaviours

define actions by Promise Theory train multiple neural nets to classify an action for a given input (train them differently to spice things up) take an environment for the agents to operate in (e.g. a 3d maze where collaboration is needed to escape) bind the agents interactions with a healthy dose of the wave collapse algorithm

Continue reading →

[RL Series 2/n] From Animals to Agents: Linking Psychology, Behaviour, Mathematics, and Decision Making

intro Maths, computation, the mind, and related fields are a fascination for me. I had thought I was quite well informed and to a large degree I did know most of the science in more traditional Computer Science (it was my undergraduate degree…). What had slipped me by was reinforcement learning, both its mathematical grounding and value of application. If you’ve come from the previous post ([RL Series 1/n] Defining Artificial Intelligence and Reinforcement Learning) you know I’ve said something like that already.

Continue reading →

[RL Series 1/n] Defining Artificial Intelligence and Reinforcement Learning

intro I’m learning about Reinforcement Learning, it’s an area that has a lot of intrigue for me. The first I recall hearing of it was when ChatGPT wes released and it was said Reinforcement Learning from Human Feedback was the key to making it so fluent in responses. Since then I’m studying AI and Data Science for a Masters so with that I’m stepping back to understand the domain in greater detail.

Continue reading →

What is Off-Policy learning?

I’ve recently dug into Temporal Difference algorithms for Reinforcement Learning. The field of study has been a ride, from Animals in the late 1890s to Control Theory, Agents and back to Animals in the 1990s (and on). It’s accumulated in me developing a Q-Learning agent, and learning about hyperparameter sweeps and statistical significance, all relevant to the efficiency of Off-policy learning but topics for another day. I write this as it took a moment for me to realise what off-policy learning actually is.

Continue reading →

Dopamine as temporal difference errors !! 🤯

I expect I’m sharing a dopamine burst that I experienced! 🤓 I’m listening to The Alignment Problem by Brian Christian 📚 and it’s explaining how Dayan, Montague, and Sejnowski* connected Wolfram Schultz’s work to the Temporal Difference algorithm (iirc that’s, of course!, from Sutton and Barto). A quick search returns these to add to my maybe reading list: Dopamine and Temporal Differences Learning (Montague, Dayan & Sejnowski, 1996) Dopamine and temporal difference learning: A fruitful relationship between neuroscience and AI (Deepmind 2020)

Continue reading →