Algorithm/Learning/TDL

Temporal Difference (TD) Learning

If one had to identify one idea as central and novel to reinforcement learning, it would undoubtedly be temporal-difference (TD) learning. TDL is a predication method, and mostly used in reinforcement learning, it is a combination of the following ideas...

  1. Monti Carlo (MC) ideas - It learns by sampling the environment according to some policy
  2. dynamic programming (DP) ideas - It approximates its current estimate based on previously learned estimates (also known as bootstrapping) without waiting for the final outcome.

The TDL algorithm is related to the TD model of animal learning.

Q-Learning and AHC are both instances of TD-Learning methods.


Comparing Temporal Difference to Predecessors

Temporal Difference (TD) vs Motnte Carlo (MC)

The main difference between MC and TD is that MC must wait until the final result before recalculating the estimates, but with TD, you will be able to learn immediately...

  • constant-α MC: V(st) ← V(st) + α[Rt - V(st)]  <<<— ASK: What is constant-α - is there a non-constant-α too? —>>> 
  • TD(0): V(st) ← V(st) + α[rt+1 +γV(st+1) - V(st)] (This is the simplest TD method, known as TD(0))

In effect, the target for the MC update is Rt, while the target for the TD update is rt+1 +γV(st+1). In english, this simply means that in TD learning, an agent would learn based on its current predictions, whereas in MC learning, the agent would have to wait until the termination at which time it would know the actual return. That is to say that TD methods learn their estimates in part on the basis of other estimates. For this reason, TD can operate in an online, fully-incremental fashion, the agent only needs to wait one time step before updating its estimates. With MC methods on the other hand, the agent must wait until the end of an episode, because only then is the return known; This means that MC is forced to perform offline, which turns out to be a critical consideration - some applications have very long episodes, so that delaying all learning until an episode's end is too slow.

Temporal Difference (TD) vs Dynamic Programming (DP)

Obviously, TD methods have an advantage over DP methods in that they do not require a model of the environment, of its reward and next-state probability distributions.


References

Ancstors ☣ AlgorithmAlgorithm/Learning
Sibblings ☣ Algorithm/Learning/DP?Algorithm/Learning/MC?
Other ☣ Agent/QLearning