What is Off-Policy learning?

I’ve recently dug into Temporal Difference algorithms for Reinforcement Learning. The field of study has been a ride, from Animals in the late 1890s to Control Theory, Agents and back to Animals in the 1990s (and on).

It’s accumulated in me developing a Q-Learning agent, and learning about hyperparameter sweeps and statistical significance, all relevant to the efficiency of Off-policy learning but topics for another day.

I write this as it took a moment for me to realise what off-policy learning actually is. It was the attached graph that meant it fully clicked.

Maybe like me you thought that the training would result in an agent actually taking the optimal path at the end.

It doesn’t. The reason is because the agent will always take a random action at some point. When it has explored enough the random action will be corrected and the agent will quickly return to the optimal path. It’s still not all optimal though.

That’s the value of Off-policy learning. Learning by exploring, and storing the info for use when it is needed.

I created the graphs below by having two agents run in tandem, one doing the off-policy learning and one following up with an on-policy execution. The first three are metrics from the first agent and the last graph is comparing the last agent’s path to the optimal one. See how the first agent gets good and receives a positive reward but the second agent hasn’t yet got a consistent optimal policy to follow. There’s a slight delay.

So that’s the key take away for me, off-policy is a training only algorithm that has an element of random/spontaneous exploring. Once trained the agent uses the learnt policy and follows an On-policy algorithm for execution.

If anyone is reading this and found it useful or has questions let me know 🤓✌🏼