Namely, that there is no point taking different actions for a given state s at different times. So the equality says that you get the same value by optimizing over a policy pi that does the same each time you see s, or a policy that for the first time step (with state s), may take a different action a.

Intuitively, this follows from the fact that the MDP is stationary (i.e., has the same distribution regardless of which time step you are at). This can be shown formally, but is not in the current notes.

Stated differently: since the MDP is always the same, it doesn't make sense that it would be better to take a different action in a different state.

Amir

]]>I think I am missing central point in all this Q, V functions idea which is why deviating from the policy in the first step does not add power (increase reward)? ]]>