If overall you sample N episodes during your training, and your batch size is k, the number of updates you'll have is N/k. ]]>

As I understand it, in our case the batch size is the number of episodes after which you update the parameters. Meaning that every "sample" is an episode.

Is that correct? If so, how could you update the parameters during an episode?

If not, could you please explain how a batch is defined in our case?

Thanks!

]]>The network structure should stay as described in the hw, and the loss/reward is given to you by the environment, and is therefore fixed by definition.

Addressing the points you made:

1) The Cartpole problem has "horizontal symmetric", so it makes sense that "go right"/"go left" actions shouldn't have any bias.

Including the bias term in the architecture shouldn't be prohibitive of course, since it can always be learned to be 0.

2) Citing the hw instructions: "You can use any batch size and stopping criteria you want". Bigger batch sizes mean param updates are more accurate, but they also mean less param updates per episode. It's a trade-off and therefore each problem has it's own "sweetspot".

3) The algorithm as described in the hw should work. If you want to implement further variance reductions, your'e of course free to do so.

Just because your'e estimating the gradient of the expected reward in a more complicated way than in eq.4 doesn't mean your'e changing the expected reward itself.

You can also estimate the number pi using different algorithms, its still the same number being estimated (:

Did this help to answer your question?

Daniel

So my question still holds - would advantage vectors and other similar techniques be OK to use? As the OpenAI gym does indicate these really help. ]]>

Are we required to stick "exactly" to the instructions as given in the homework when implementing the network structure and loss function?

I'm asking since when training the CartPole problem as a baseline, I encountered multiple phenomena:

- Removing the bias in the last layer (the expression going to the softmax) makes the network train much better on CartPole
- Using smaller batch sizes (e.g. 5 episodes instead of 10) really helped it to converge on CartPole (yes, we did normalize the gradients to the average contribution of all samples in the batch)
- The internet suggests using "advantages" to better improve the training, and indeed many submissions to the AI gym do use it
- etc.

So, the question essentially is - which parts are we allowed to play with? Which parts should we keep fixed?

Thanks in advance,

~ Barak