Resume
Research
Learning
Blog
Talks
Teaching
Jokes
Kernel Papers

3 March 2021

# Hyperbolic Discounting and Learning over Multiple Horizons

by Fedus, Gelada, Bengio, Bellemare and Larochelle (Arxiv)

## Motivation

Geometric discounting is ubiquitous in reinforcement learning e.g. Q-learning:

$Q(s_t, a_t) = r_t + \gamma \max_a Q(s_{t+1}, a)$

This work shows how geometric discounting arises from a single belief as to the environment’s hazard rate, and how different beliefs over the hazard rate lead to different discounting schemes (e.g. hyperbolic).

## Correct Credit

This work was rejected from ICLR; one reviewer criticized the work for adding little beyond preceding work, specifically Zeb Kurth-Nelson and A David Redish’s “Temporal-difference reinforcement learning with distributed representations” (2009) and Peter D Sozou’s “On hyperbolic discounting and uncertain hazard rates” (1998). I don’t have time to review who contributed what, but know that these authors may not deserve credit for the following ideas.

## Geometric Discounting from Constant Hazard Rate

We first need two functions: the survival function, $$s(t)$$, which gives the probability the agent is alive at time $$t$$, and the hazard rate

$h(t) = - \partial_t \log s(t) = -\frac{1}{s(t)} \partial_t s(t)$

The authors reason that one should discount a future reward by the probability of being alive to obtain it:

$V(r_t) = s(t) r_t$

The authors then show that geometric discounting arises under a constant hazard rate. Specifically, suppose $$h(t) = \lambda > 0$$ is constant. Then, solving for $$s(t)$$

\begin{align} \lambda &= - \partial_t \log s(t)\\ -\int_0^T \lambda dt &= \log s(t)\\ s(t) &= e^{-\lambda t}\\ &= (e^{-\lambda})^t\\ &= \gamma^t \end{align}

where $$\gamma := e^{-\lambda} = e^{-h(t)}$$. The intuition is that $$\gamma$$ is the per-time-step probability of the episode continuing. For instance, suppose the hazard rate $$h(t) \rightarrow \infty$$ i.e. the probability of surviving is basically 0; then, $$\gamma \rightarrow 0$$, a very myopic view which makes sense given one’s expectation they won’t survive.

## Alternative Discounting from Non-Constant Hazard Rate

If a constant hazard rate gives rise to geometric discounting, what discounting does a non-constant hazard rate give rise to? Suppose we have a prior $$p(\lambda)$$ on the hazard rate i.e.

$s(t) = \int_{\lambda=0}^{\infty} p(\lambda) e^{-\lambda t} d\lambda$

If $$p(\lambda)$$ is exponential with rate $$1/k$$, then the survival function is hyperbolic, meaning rewards should be discounted hyperbolically.

\begin{align} s(t) &= \int_{\lambda=0}^{\infty} p(\lambda) e^{-\lambda t} d\lambda\\ &= \int_{\lambda=0}^{\infty} \frac{1}{k} e^{-\lambda/k} e^{-\lambda t} d\lambda\\ &= \frac{1}{k} \frac{-1}{1/k + t} e^{\lambda (1/k + t)} \Big\lvert_{\lambda=0}^{\infty}\\ &= -\frac{1}{1 + kt} [0 - 1]\\ &= \frac{1}{1 + kt} \end{align}

Other priors over the hazard rate give different discounting schemes. For instance, suppose $$p(\lambda) = \frac{1}{k}$$ i.e. uniform over $$[0, k]$$. Then

$s(t) = \frac{1}{kt}(1 - e^{-kt})$ tags: reinforcement-learning - discounting - hazard-rate - survival-function