Algorithm/Learning/TDL/QLearning

Q-Learning (Off-Policy TD Control)

One of the most important breakthroughs in reinforcement learning was the development of an off-policy TD control algorithm known as Q-learning (Watkins, 1989).

Q-learning is a specific kind of reinforcement learning that assigns values to state-action pairs. The state of the organism is a sum of all its sensory input, including its body position, its location in the environment, the neural activity in its head, etc. This means that because for every state there are a number of possible actions that could be taken, each action within each state has a value according to how much or little rewards the organism will get for completing that action (and reward means survival).

Q-learning is a recent form of reinforcement learning algorithm that does not need a model of its environment and can be used online, and is a good example of model-free algorithm.

The AHC action-value functions can be adapted to state-action functions, as it was in the Sarsa algorithm.

The simplest form of Q-Learning, that is one-step Q-Learning, can be defined as follows...

  • Q(st, at) ← Q(st, at) + α[rt+1 +γmaxaQ(st+1, at+1) - Q(st, at)]

This can also be presented as...

  • Q(st, at) ← (1 - α)Q(st, at) + α[rt+1 +γmaxaQ(st+1, at+1)]

In this case Q directly approximates Q*, independent of the policy being followed.

  • Q: The learned action-value function
  • Q*: The optimal action-value function

Terminology

Q-Value Q*(s, a) = R(s, a) + γ.V*(δ(s, a)) The expected, discounted, reinforcement of raking action a in state s, then contnuing to take the policy of choosing actions to maximise Q*
Optimal Policy π*(s) = argmaxaQ*(s, a)
Absorbing Goal - A highly-rewarding self-referential loop that results in the agent staying in it's place forever.

One Step Q-Learning

 <<<— TODO —>>> 


References

Ancstors ☣ AlgorithmAlgorithm/LearningAlgorithm/Learning/TDL
Siblings ☣ Algorithm/Learning/TDL/AHCAlgorithm/Learning/TDL/Sarsa
Other ☣ Agent/QLearning