Hi
I've ran our lunar landar learner many times and it generally seems to be working but there are a lot of fluctuations - sometimes it learns fast and even reaches 200 average reward, sometimes it gets to about a 100 and then deteriorates quickly and sometimes it doesn't improve at all
I've tried to change the batch size and learning rate but it doesn't help stablize the outcome.
does this make sense or is there probably something wrong with the code?
And on a similar note, when submitting, does an average reward of 180 means average across a batch?
And should our code reach 180 on every run? Or should we just output the weights when it does?
Thanks in advance!