I think it needs to be np.random.multinomial instead of np.multinomial

]]>I had the same problem as well. It's a numeric stability issue.

You need to convert your array to the proper dtype (32/64) and then normalize your array. ]]>

Searching this in Google, I found that there's an open issue about this:

github[dot]com/numpy/numpy/issues/8317

The function rounds its input (in our case the softmax output) from float32 to float64, which cause the sum of the vector elements to be greater than 1.

This is a problem since multinomial assumes that the input vector is a valid distribution.

The old skeleton contained the line: "action = np.argmax(action_probs)" which is wrong, since it means the policy doesn't really sample anything.

The skeleton is now updated to have the line: "action = np.argmax(np.multinomial(1,action_probs))". ]]>