In question 2 in the homework, we have
Prove that for MDPs, stochastic policies are not better than deterministic ones. Namely, show that if a stochastic policy $\pi^*(a|s)$ obtains an optimal value function $V^*(s)$, then there is a deterministic
policy $\hat\pi(a)$ that achieves the same value
- Are these typos? I'm pretty sure $\hat\pi$ should be defined over $s$ and not $a$
- Regarding notations - $\pi^*(a|s)$ is the probability of an action given a state, whereas $\hat\pi(s)$ is the action picked when with a deterministic policy?
- If I understand correctly, proposition 2.4 and proposition 2.6 (from the scribes) almost solve this question entirely. Am I missing anything, or is it really that simple?
Thanks in advance!