Hi,
There were two important steps that were erroneously omitted from the HW instructions:
- First of all, the rewards should be discounted, i.e when summing rewards for computing the gradient, the reward which is the k-th summand should be multiplied by gamma^k.
For example, if the rewards to be summed for some step are 1, 1, 1, 1, and gamma was 0.5, you'll need to sum 1, 0.5, 0.25, 0.125.
For the exercise, use gamma = 0.99.
- Second of all, you should have your discounted rewards also normalized per episode.
This means that after you compute the sums of discounted rewards, instead of multiplying the log-probabilities' derivatives by them, multiply them by a normalized version of them.
The normalization is done by subtracting the mean, and dividing by the standard deviation.
So for example if an episode had 4 steps, and it's sums of discounted rewards were 2015,2016,2017,2018, then subtracting the mean (2016.5) will give -1.5,-0.5,0.5,1.5, and dividing by the standard deviation (~1.118) will now give -1.341,-0.447,0.447,1.341.
We apologize if this caused any confusion for you.
The HW submission date is extended to 2/7/17 so you could catch up with this update etc.
There will also be an corrected HW instruction pdf uploaded soon.
Have a nice weekend,
Daniel