The expected value is over the randomness of the environment (in our case it's the lunar terrain, location of spaceship relative to the landing pad,..) and the randomness of the policy. ]]>

give parameters $\theta$ we simulate the enviroment for T steps (using the same parameters) and then calculate the gradient based on those results?

if so what is T? the amount of steps until the simulation returned done?

what are we computing expected value over? (what are the random variables?)

