I uploaded a table with the grades for HW 1-4.

For HW4, some students didn't seem to submit the theory questions / the programming assignment.

If you believe you submitted the theory questions / programming assignment but the table says otherwise,

please mail me at:

li.ca.uat.liam|adnomrac#li.ca.uat.liam|adnomrac

Thanks,

Daniel

In order to reach the test level I maid my training algorithm to save candidate of parameters and then run a second processes to choose the best one.

After I choose the best params i saved it as 'ws.p' file as required.

So my lunar_lander.py code just save a lots of files and not single 'ws.p' file? is it ok?

I don't want to change it since if I change it to save only one 'ws.p' file it will override the best params i choose from the candidates.

Thanks,

Alon

I can't submit my HW to the google forms and I know that this problem occurs for others as well.

Is there any other way to submit? ]]>

I want to submit my code but it seems that i can't import tensorflow on nova. I opened python and tried to import and got an error:

ImportError: libcudart.so.8.0: cannot open shard object file: No such file or directory.

Does anyone else have this problem? do I need to install tensorflow and opengym on nova or something?

If not, how do i fix this? I probably need to do some modification in my code in order for it to wort for python 2.7 but i can't as long as this problem occuring.

I've notices that when I calculate the gradient with -loss i get increasing results, but when I calculate with +loss i get decreasing results.

I guessed that it's because the -loss means "whatever you just did, don't do it!" and the opposite.

my question is: How do i know if i want to calculate the gradients with -loss or +loss, assuming not all the time (-) will be "no" and (+) will be yes?

Just to be clear in case i used the wrong term: by "loss" I mean equation no.4 in the H.W

]]>For the programming assignment, please submit your work via the following updated guidelines:

- You need to submit a Pickle file of your network's parameters.

The file will store an array in the following format: [W1,b1,W2,b2,W3,b3]. A tester file is now uploaded so you can use it to check that your params are in the right format. You might need to transpose your matrices etc to be compatible with the tester, so please check that.

- Enter the path to your code and pickle file's directory to the following updated Google form: https://tinyurl.com/ybskbrq4

You might need to change your access permissions so that your hw can be accessed, so please check that as well.

Thanks,

Daniel

]]>Is it ok to print the average reward after every batch? And generally to print progress messages?

Also is it ok to stop at >200 average reward? since this means the task has been solved

This could save a lot of excess episodes

Thanks!

]]>Once I switch to xavier_initializer it's converging. Is this to be expected? Or is this pointing on a problem in my nn/optimizer? I'm using 2 layers of 10 neurons and Adam.

Thanks! ]]>

seems to work without it… ]]>

Did anyone else tackled this issue?

]]>There were two important steps that were erroneously omitted from the HW instructions:

- First of all, the rewards should be discounted, i.e when summing rewards for computing the gradient, the reward which is the k-th summand should be multiplied by gamma^k.

For example, if the rewards to be summed for some step are 1, 1, 1, 1, and gamma was 0.5, you'll need to sum 1, 0.5, 0.25, 0.125.

For the exercise, use gamma = 0.99.

- Second of all, you should have your discounted rewards also normalized per episode.

This means that after you compute the sums of discounted rewards, instead of multiplying the log-probabilities' derivatives by them, multiply them by a normalized version of them.

The normalization is done by subtracting the mean, and dividing by the standard deviation.

So for example if an episode had 4 steps, and it's sums of discounted rewards were 2015,2016,2017,2018, then subtracting the mean (2016.5) will give -1.5,-0.5,0.5,1.5, and dividing by the standard deviation (~1.118) will now give -1.341,-0.447,0.447,1.341.

We apologize if this caused any confusion for you.

The HW submission date is extended to 2/7/17 so you could catch up with this update etc.

There will also be an corrected HW instruction pdf uploaded soon.

Have a nice weekend,

Daniel

For some of the running I got a really good results, but for other runs i can't get to the maximum results.

That maybe makes sense since the option to get the best results depend on the starting configuration, which change every run.

But for some of the runs I get to the maximum results, then after a while drop gradually to zero without ever to rise again. Does that makes sense? why can this thing happen?

]]>I implemented the game as asked in the exercise:

15 hidden, batches of 10, learning_rate 0.01.

After 165000 games played (meaning 16500 gradients update) it does not converges at all.

The CartPole, however, did converged after 5000 games played.

Do you know how long I should take(I am way passed that 30K and still rewards are about -180)?

What is the learning rate that is prefered?

In the last layer of our NN we were told to use softmax, but we activate it on a single float (a number)

according to softmax defenition in tensorflow, it makes sense that it will return 1 every time, if we fed it with a float:

softmax = exp(logits) / reduce_sum(exp(logits), dim)

but if we get 1 all the time we will never be able to progress.

can someone please help me solve this problem?

when using tf.nn.softmax in my agent and np.random.multinomial on the result, once in maybe a 10000 episodes I get this error -

in mtrand.RandomState.multinomial (numpy/random/mtrand/mtrand.c:37769)

ValueError: sum(pvals[:-1]) > 1.0

I guess its because the result of tf.nn.softmax in that case doesn't sum exactly to 1 for some reason.

Is there any way to fix this issue?

would a try catch be the best option? or should I normalize again just in case (or after checking if it sums to 1)

thanks!

]]>I've ran our lunar landar learner many times and it generally seems to be working but there are a lot of fluctuations - sometimes it learns fast and even reaches 200 average reward, sometimes it gets to about a 100 and then deteriorates quickly and sometimes it doesn't improve at all

I've tried to change the batch size and learning rate but it doesn't help stablize the outcome.

does this make sense or is there probably something wrong with the code?

And on a similar note, when submitting, does an average reward of 180 means average across a batch?

And should our code reach 180 on every run? Or should we just output the weights when it does?

Thanks in advance!

]]>*Prove that for MDPs, stochastic policies are not better than deterministic ones. Namely, show that if a stochastic policy $\pi^*(a|s)$ obtains an optimal value function $V^*(s)$, then there is a deterministic policy $\hat\pi(a)$ that achieves the same value*

- Are these typos? I'm pretty sure $\hat\pi$ should be defined over $s$ and not $a$
- Regarding notations - $\pi^*(a|s)$ is the probability of an action given a state, whereas $\hat\pi(s)$ is the action picked when with a deterministic policy?
- If I understand correctly, proposition 2.4 and proposition 2.6 (from the scribes) almost solve this question entirely. Am I missing anything, or is it really that simple?

Thanks in advance!

]]>