Learning

[zero-RL] LUFFY: Learning to reason Under oFF policY guidance

Monday, April 28, 2025

Based on conventional zero-RL methods such as GRPO, LUFFY introduces off-policy reasoning traces (e.g., from DeepSeek-R1) and combines them with models' on-policy roll-outs before advantage computation. … However, naively combining off-policy traces can lead to overly rapid convergence and entropy collapse, causing the model to latch onto superficial patterns rather than acquiring genuine reasoning capabilities. …genuine reasoning capabilities… I am not certain if the implication is that Deepseek-R1 can reason or that it is a reminder that no model cam genuinely reason.

Continue reading →

[zero-RL] what is it?

Monday, April 28, 2025

Zero-RL applies reinforcement learning RL to base LM directly, eliciting reasoning potentials using models' own rollouts. A fundamental limitation worth highlighting: it is inherently “on-policy”, constraining learning exclusively to the model’s self-generated outputs through iterative trials and feedback cycles. Despite showing promising results, zero-RL is bounded by the base LLM itself. A key characteristic is that it means a LLM can be trained without Supervised Fine Tuning (SFT).

Continue reading →

[zero-RL] When you SFT a smaller LM on the reasoning traces of a larger LM

Monday, April 28, 2025

You are doing Imitation Learning (specifically Behavioral Cloning) because the goal and mechanism involve mimicking the expert’s token sequences. You are doing Transfer Learning (specifically Knowledge Distillation) because you are transferring reasoning knowledge from a teacher model to a student model. You are not doing Off-Policy Reinforcement Learning because the learning process is supervised likelihood maximization, not reward maximization using RL algorithms. Although the data itself is “off-policy” (not generated by the model being trained), the learning paradigm is supervised imitation, not RL.

Continue reading →

Notes and links on SVMs (WIP)

Saturday, April 26, 2025

Support Vector Machines (SVM) are a mathematical approach for classifying data by finding optimal separating hyperplanes, applicable even in non-linear scenarios using kernel methods.

Continue reading →

[IA Series 2/n] Search Algorithms and Intelligent Agents

Thursday, April 24, 2025

The document discusses various search algorithms used by Intelligent Agents for navigating mazes, detailing their types, characteristics, tradeoffs, and implementations.

Continue reading →

[IA Series 1/n] AI Search - Terms and Algorithms

Thursday, April 24, 2025

This text introduces key concepts and algorithms related to intelligent agents in AI, focusing on search terms, uninformed and informed search strategies, and adversarial search techniques.

Continue reading →

Saturday, March 22, 2025 →

“But who was learning, you or the machine?”

“Well, I suppose we both were”

Amazing book 🔥🤓

#TheAlignmentProblem #Learning #ResponsibleAI

The Alignment Problem by Brian Christian 📚

[NN Series 5/n] Regularisation: reducing the complexity of a model without compromising accuracy

Monday, March 17, 2025

Regularisation is known to reduce overfitting when training a neural network. As with a lot of these techniques there is a rich background and many options available, so asking the question why and how opens up to a lot of information. Diving through the information, for me at least, it wasn’t clear why/how it did this until I reframed what it was doing. In short, regularisation changes the sensitivity of the model to the training data.

Continue reading →

[NN Series 4/n] Feature Normalisation

Thursday, March 6, 2025

This is an interesting one as I’d thought it was quite academic, with limited utility. Then I saw these graphs Error per epoch This graph shows the error per epoch of training a model on the data as is We can see that it takes around 180-200 epochs to train with a learning rate (eta) of 0.0002 or lower. Now compare it to this one Here we see the training takes around 15 epochs with a learning rate of 0.

Continue reading →

[NN Series 3/n] Calculating the error before quantisation: Gradient Descent

Tuesday, February 25, 2025

Next I’m looking at the Adaline in python code. This post is a mixture of what I’ve learnt in my degree, Sebestien Raschka’s book/code, and the 1960 paper that delivered the Adaline Neuron. Difference between the Perceptron and the Adaline In the first post we looked at the Perceptron as a flow of inputs (x), multiplied by weights (w), then summed in the Aggregation Function and finally quantised in the Threshold Function.

Continue reading →

[NN Series 2/n] Circuits that can be trained to match patterns: The Adaline

Monday, February 24, 2025

The text discusses the development and significance of the Adaline artificial neuron, highlighting its introduction of non-linear activation functions and cost minimization, which have important implications for modern machine learning.

Continue reading →

#BeingHuman - look after your << self >>: love is all it needs.

Sunday, February 23, 2025

The author shares personal reflections on self-kindness and positive thinking as tools for finding peace amid societal challenges.

Continue reading →

[NN Series 1/n] From Neurons to Neural Networks: The Perceptron

Wednesday, February 12, 2025

This post looks at the Percepton, from Frank Rosenblatt’s original paper to a practical implementation classifying Iris flowers. The Perceptron is the original Artificial Neuron and provided a way to train a model to classify linearly separable data sets. The Perceptron itself had a short life, with the Adaline coming in 3 years later. However it’s name lives on as neural networks have, Multilayer Perceptrons (MLPs). The naming shows the importance of this discovery.

Continue reading →

[RL Series 2/n] From Animals to Agents: Linking Psychology, Behaviour, Mathematics, and Decision Making

Friday, February 7, 2025

intro Maths, computation, the mind, and related fields are a fascination for me. I had thought I was quite well informed and to a large degree I did know most of the science in more traditional Computer Science (it was my undergraduate degree…). What had slipped me by was reinforcement learning, both its mathematical grounding and value of application. If you’ve come from the previous post ([RL Series 1/n] Defining Artificial Intelligence and Reinforcement Learning) you know I’ve said something like that already.

Continue reading →

[RL Series 1/n] Defining Artificial Intelligence and Reinforcement Learning

Friday, January 31, 2025

intro I’m learning about Reinforcement Learning, it’s an area that has a lot of intrigue for me. The first I recall hearing of it was when ChatGPT wes released and it was said Reinforcement Learning from Human Feedback was the key to making it so fluent in responses. Since then I’m studying AI and Data Science for a Masters so with that I’m stepping back to understand the domain in greater detail.

Continue reading →

What is Off-Policy learning?

Friday, January 31, 2025

I’ve recently dug into Temporal Difference algorithms for Reinforcement Learning. The field of study has been a ride, from Animals in the late 1890s to Control Theory, Agents and back to Animals in the 1990s (and on). It’s accumulated in me developing a Q-Learning agent, and learning about hyperparameter sweeps and statistical significance, all relevant to the efficiency of Off-policy learning but topics for another day. I write this as it took a moment for me to realise what off-policy learning actually is.

Continue reading →

Dopamine as temporal difference errors !! 🤯

Wednesday, January 15, 2025

I expect I’m sharing a dopamine burst that I experienced! 🤓 I’m listening to The Alignment Problem by Brian Christian 📚 and it’s explaining how Dayan, Montague, and Sejnowski* connected Wolfram Schultz’s work to the Temporal Difference algorithm (iirc that’s, of course!, from Sutton and Barto). A quick search returns these to add to my maybe reading list: Dopamine and Temporal Differences Learning (Montague, Dayan & Sejnowski, 1996) Dopamine and temporal difference learning: A fruitful relationship between neuroscience and AI (Deepmind 2020)

Continue reading →