Type of RL algorithm that optimizes the policy directly by computing gradients of expected rewards with respect to policy parameters.
Generality: 805