Zero-RL applies reinforcement learning RL to base LM directly, eliciting reasoning potentials using models' own rollouts. A fundamental limitation worth highlighting: it is inherently “on-policy”, constraining learning exclusively to the model’s self-generated outputs through iterative trials and feedback cycles. Despite showing promising results, zero-RL is bounded by the base LLM itself.
A key characteristic is that it means a LLM can be trained without Supervised Fine Tuning (SFT).
You are doing Imitation Learning (specifically Behavioral Cloning) because the goal and mechanism involve mimicking the expert’s token sequences.
You are doing Transfer Learning (specifically Knowledge Distillation) because you are transferring reasoning knowledge from a teacher model to a student model.
You are not doing Off-Policy Reinforcement Learning because the learning process is supervised likelihood maximization, not reward maximization using RL algorithms.
Although the data itself is “off-policy” (not generated by the model being trained), the learning paradigm is supervised imitation, not RL.
Support Vector Machines (SVM) are a mathematical approach for classifying data by finding optimal separating hyperplanes, applicable even in non-linear scenarios using kernel methods.
The document discusses various search algorithms used by Intelligent Agents for navigating mazes, detailing their types, characteristics, tradeoffs, and implementations.
This text introduces key concepts and algorithms related to intelligent agents in AI, focusing on search terms, uninformed and informed search strategies, and adversarial search techniques.
This post is inspired by a conversation with a fellow Data Science and AI student. It’s from the conversation, co-authored with Claude. Hope it’s useful!
Were to begin? When you’re starting your data science journey with Python, one of the first roadblocks you’ll encounter is package management. If you’ve tried conda and found it frustrating (as many do), you’re not alone. Let’s explore two modern tools that make Python package management simpler and more reliable: pipx and uv.
The text reflects on the misuse of technology and ethics in Silicon Valley, highlighting the importance of awareness and compassion amidst current challenges.
Great podcast where she talks about why America is broken.
Ideological sabotage (which surprised me, I’d thought the initial perpetrators would have done it for money) of the scientific method to protect the “right of freedom” has done exactly the opposite…
It really feels like some Americas are stuck fighting against no longer existence foes, either the British tyranny and taxation without representation or the war between capitalism and communism.
During the first phase of the project, the robots will be trained with approximately 45 atomic skills such as grasping, picking, placing and transporting
A single action may need to be repeated up to 600 times a day by a data collector for the robots to learn from
10 key scenarios, including industrial, domestic, and tourism services
It is expected that the collection of over 10 million real-machine data entries will be achieved within the year
The analysis concludes that while the EU AI Act does not obstruct startups, it presents both challenges and opportunities for innovation within a complex regulatory landscape.
I hope people can rally around and stop the Baffons soon. 💪🏼
#BeingHuman
BBC news article is very clear…
The Russian president has given the US leader just enough to claim that he made progress towards peace in Ukraine, without making it look like he was played by the Kremlin.
There’s a lot of change at the moment, my feed is all about foreign policies, US government cuts, AI writing all code, and now parenting adolescents.
I’ve been experiencing a high level of uncertainty about Europe’s place in the world, mainly what decisions will be made after the US made their policy clear.
Though there’s one area that my uncertainty is decreasing in; that’s AI generated code. It won’t replace software engineers.
Regularisation is known to reduce overfitting when training a neural network. As with a lot of these techniques there is a rich background and many options available, so asking the question why and how opens up to a lot of information. Diving through the information, for me at least, it wasn’t clear why/how it did this until I reframed what it was doing.
In short, regularisation changes the sensitivity of the model to the training data.
define actions by Promise Theory train multiple neural nets to classify an action for a given input (train them differently to spice things up) take an environment for the agents to operate in (e.g. a 3d maze where collaboration is needed to escape) bind the agents interactions with a healthy dose of the wave collapse algorithm
(I forget exactly but I’m pretty sure this is from an Alan Watts lecture).
A farmer needs some help around his farm. He puts up a sign in town, asking for someone with general skills to help around the farm.
A gent arrives two days later with his toolkit, the farmer welcomes him and tells him there are some broken fences on the north side of his farm.
The gent heads over to the area, spends the day fixing the fences, and comes back in the evening to tell the farmer it’s done.