"Stochastic Direct Reinforcement: Policies with Recurrence"
A paradigm shift is underway in reinforcement learning (RL). The dominant approach to RL over the past 20 years has been based on dynamic programming, whereby RL agents learn an abstract value function (VF). An alternative approach, direct reinforcement (DR), has recently been revisited, wherein DR agents learn strategies to solve problems directly. DR can enable a simpler problem representation, avoid Bellman's curse of dimensionality, and offer compelling advantages in efficiency.
This talk will begin with a short introduction to RL. I will briefly trace its history and origins, distinguish RL from other statistical learning approaches, and contrast value function methods with direct reinforcement.
I will then present a new algorithm called Stochastic Direct Reinforcement (SDR). This policy gradient algorithm is formulated for uncertain environments, partially-observed states and non-Markovian policies. Since SDR agents represent policies directly, they can naturally incorporate recurrent structure that is intrinsic to many potential applications and better solve the temporal credit assignment problem.
Demonstrations of DR include repeated games and trading financial markets. We show that SDR agents can learn winning strategies in simple competitive games. For the Iterated Prisoner's Dilemma, we find that non-recurrent SDR agents learn only the Nash equilibrium strategy of defection, while recurrent SDR agents can learn the global Pareto-optimal strategy of cooperation.
Time permitting, I'll give a brief preview of one or two financial results from my Wednesday Neyman seminar "Learning to Trade via Direct Reinforcement".