by Fedus, Gelada, Bengio, Bellemare and Larochelle (Arxiv)
Geometric discounting is ubiquitous in reinforcement learning e.g. Q-learning:
\[Q(s_t, a_t) = r_t + \gamma \max_a Q(s_{t+1}, a)\]This work shows how geometric discounting arises from a single belief as to the environment’s hazard rate, and how different beliefs over the hazard rate lead to different discounting schemes (e.g. hyperbolic).
This work was rejected from ICLR; one reviewer criticized the work for adding little beyond preceding work, specifically Zeb Kurth-Nelson and A David Redish’s “Temporal-difference reinforcement learning with distributed representations” (2009) and Peter D Sozou’s “On hyperbolic discounting and uncertain hazard rates” (1998). I don’t have time to review who contributed what, but know that these authors may not deserve credit for the following ideas.
We first need two functions: the survival function, \(s(t)\), which gives the probability the agent is alive at time \(t\), and the hazard rate
\[h(t) = - \partial_t \log s(t) = -\frac{1}{s(t)} \partial_t s(t)\]The authors reason that one should discount a future reward by the probability of being alive to obtain it:
\[V(r_t) = s(t) r_t\]The authors then show that geometric discounting arises under a constant hazard rate. Specifically, suppose \(h(t) = \lambda > 0\) is constant. Then, solving for \(s(t)\)
\[\begin{align} \lambda &= - \partial_t \log s(t)\\ -\int_0^T \lambda dt &= \log s(t)\\ s(t) &= e^{-\lambda t}\\ &= (e^{-\lambda})^t\\ &= \gamma^t \end{align}\]where \(\gamma := e^{-\lambda} = e^{-h(t)}\). The intuition is that \(\gamma\) is the per-time-step probability of the episode continuing. For instance, suppose the hazard rate \(h(t) \rightarrow \infty\) i.e. the probability of surviving is basically 0; then, \(\gamma \rightarrow 0\), a very myopic view which makes sense given one’s expectation they won’t survive.
If a constant hazard rate gives rise to geometric discounting, what discounting does a non-constant hazard rate give rise to? Suppose we have a prior \(p(\lambda)\) on the hazard rate i.e.
\[s(t) = \int_{\lambda=0}^{\infty} p(\lambda) e^{-\lambda t} d\lambda\]If \(p(\lambda)\) is exponential with rate \(1/k\), then the survival function is hyperbolic, meaning rewards should be discounted hyperbolically.
\[\begin{align} s(t) &= \int_{\lambda=0}^{\infty} p(\lambda) e^{-\lambda t} d\lambda\\ &= \int_{\lambda=0}^{\infty} \frac{1}{k} e^{-\lambda/k} e^{-\lambda t} d\lambda\\ &= \frac{1}{k} \frac{-1}{1/k + t} e^{\lambda (1/k + t)} \Big\lvert_{\lambda=0}^{\infty}\\ &= -\frac{1}{1 + kt} [0 - 1]\\ &= \frac{1}{1 + kt} \end{align}\]Other priors over the hazard rate give different discounting schemes. For instance, suppose \(p(\lambda) = \frac{1}{k}\) i.e. uniform over \([0, k]\). Then
\[s(t) = \frac{1}{kt}(1 - e^{-kt})\] tags: reinforcement-learning - discounting - hazard-rate - survival-function