Hi,
Are we required to stick "exactly" to the instructions as given in the homework when implementing the network structure and loss function?
I'm asking since when training the CartPole problem as a baseline, I encountered multiple phenomena:
- Removing the bias in the last layer (the expression going to the softmax) makes the network train much better on CartPole
- Using smaller batch sizes (e.g. 5 episodes instead of 10) really helped it to converge on CartPole (yes, we did normalize the gradients to the average contribution of all samples in the batch)
- The internet suggests using "advantages" to better improve the training, and indeed many submissions to the AI gym do use it
- etc.
So, the question essentially is - which parts are we allowed to play with? Which parts should we keep fixed?
Thanks in advance,
~ Barak