Classically, RL concerns maximizing the expected return. Many have looked at alternative pursuits (e.g. Gilbert & Weng’s 2016 Quantile RL), but the field didn’t take off until approximately 2017, when a series of papers emerged demonstrating that learning the full return distribution, and not just its mean, produced agents that appeared to learn faster and symptote to higher return.
The Bellman operator, classically defined, aims to reach a self-consistent set of predictions. Let \(Q(s,a): \mathbb{S} \times \mathbb{A} \rightarrow \mathbb{R}\) be the expected return of being in state \(s\) and taking action \(a\). The Bellman operator \(\mathcal{T}: Q \rightarrow Q\) is:
\[T Q(s,a) = \mathbb{E}_r[R(s,a)] + \gamma \mathbb{E}_{S', A'} Q(S, A)\]where state \(S'\) is the next state with available actions \(A'\). The Bellman operator is powerful because is a contraction, meaning its repeated application will converge to a fixed \(Q\) function. Bellemare, Dabney and Munos 2017 asked whether defining a distributional equivalent of the Bellman operator that is also a contraction is possible. We define
We start by defining the set of action-value distributions, which maps a state and an action to a probability distribution over the return:
\[\mathcal{Z} = \{ Z : S \times A \rightarrow P(\mathbb{R}) \}\]