thanks! ]]>

For a given episode, the identity shown in class says that we take suffixes of sums of discounted rewards, and multiply them in the corresponding log-probability gradient.

Let's call these suffixes reward signals.

So for every episode, if we had T actions during the episode, we have T reward signals.

As said before, the method described in class uses these reward signals to learn, by multiplying them with the corresponding log-probability gradients.

The updated instructions differ from the previous by that that your'e asked to use "normalized" reward signals instead of the "original" reward signals.

As described above, the normalization is done by subtracting their mean, and then dividing by the standard deviation.

for example if an episode had 4 steps, and it's

*sums* of discounted rewards were 2015,2016,2017,2018, then subtracting the mean (2016.5) will give -1.5,-0.5,0.5,1.5, and dividing by the standard deviation (~1.118) will now give -1.341,-0.447,0.447,1.341

How can we normalize the sums exactly if what we have is individual rewards? Do we want to do the subtraction and normalization from each cumulative sum?

Do we divided the cumsum from time T to the end by \gamma^t so that it will be discounted compared to the base step, or do we keep it discounted?

And quite a lot of other questions that are unclear and would be easily answered if the instructions with the formula would be posted.

Currently I wasted yet another weekend on missing instructions, by trying to do experiments on instructions that were not posted, as the file in the site is still in the old version…

]]>The instructions on the website are not updated (the word discount does not even appear in the document), and it seems to be the old version is there instead of the new one.

Hence I can't work on the homework (sorry, I can't exactly understand the normalization described from the description here).

Any chance of fixing that?

Thanks in advance,

~ Barak

- Reducing the variance of your estimate

- Scaling the step-size in a way that helps the optimization process ]]>

Thanks! ]]>

Since we are using partial comulative rewards,

I.e the policy at step t is multiplied by R_t+…+R_T and not R_1+…+R_T

Wouldn't it make sense that R_t will be multiplied by gamma^1 instead of gamma^t and so on?

This way every action will be rewarded by 0.99 times the immidiate reward it achieved.

Thanks!

]]>There were two important steps that were erroneously omitted from the HW instructions:

- First of all, the rewards should be discounted, i.e when summing rewards for computing the gradient, the reward which is the k-th summand should be multiplied by gamma^k.

For example, if the rewards to be summed for some step are 1, 1, 1, 1, and gamma was 0.5, you'll need to sum 1, 0.5, 0.25, 0.125.

For the exercise, use gamma = 0.99.

- Second of all, you should have your discounted rewards also normalized per episode.

This means that after you compute the sums of discounted rewards, instead of multiplying the log-probabilities' derivatives by them, multiply them by a normalized version of them.

The normalization is done by subtracting the mean, and dividing by the standard deviation.

So for example if an episode had 4 steps, and it's sums of discounted rewards were 2015,2016,2017,2018, then subtracting the mean (2016.5) will give -1.5,-0.5,0.5,1.5, and dividing by the standard deviation (~1.118) will now give -1.341,-0.447,0.447,1.341.

We apologize if this caused any confusion for you.

The HW submission date is extended to 2/7/17 so you could catch up with this update etc.

There will also be an corrected HW instruction pdf uploaded soon.

Have a nice weekend,

Daniel