Temporal Difference waits for only one interaction (one step) \(S_{t+1}\)​ to form a TD target and update \(V(S_t)\) using \(R_{t+1}\) and \(\gamma∗V(S_{t+1})\).

The idea with TD is to update the \(V(S_t)\) at each step.

But because we didn’t experience an entire episode, we don’t have \(Gt\)​ (expected return). Instead, we estimate \(G_t\) by adding \(R_{t+1}\)​ and the discounted value of the next state.

This is called bootstrapping. It’s called this because TD bases its update part on an existing estimate \(V(S_{t+1})\) and not a complete sample \(G_t\)​. TD approach

Example


TD example

  • We just started to train our value function, so it returns 0 value for each state.
  • Our learning rate (lr) is 0.1, and our discount rate is 1 (no discount).
  • Our mouse explore the environment and take a random action: going to the left
  • It gets a reward \(R_{t+1}=1\) since it eats a piece of cheese
  • We can now update \(V(S_0)\):
  • New \(V(S_0)=V(S_0)+lr∗(R_1+\gamma∗V(S_1)−V(S_0))=0+0.1∗(1+1∗0–0)=0.1\)
  • So we just updated our value function for State 0. Now we continue to interact with this environment with our updated value function.