Going beyond average for reinforcement learning

Consider the commuter who toils backwards and forwards each day on a train. Most mornings, her train runs on time and she reaches her first meeting relaxed and ready. But she knows that once in awhile the unexpected happens: a mechanical problem, a signal failure, or even just a particularly rainy day. Invariably these hiccups disrupt her pattern, leaving her late and flustered.Randomness is something we encounter everyday and has a profound effect on how we experience the world. The same is true in reinforcement learning (RL) applications, systems that learn by trial and error and are motivated by rewards. Typically, an RL algorithm predicts the average reward it receives from multiple attempts at a task, and uses this prediction to decide how to act. But random perturbations in the environment can alter its behaviour by changing the exact amount of reward the system receives.Ina new paper, we show it is possible to model not only the average but also the full variation of this reward, what we call the value distribution.