An Algorithmic Theory of Metacognition in Minds and Machines
Abstract
We propose modifying an Actor-Critic so that the Actor and Critic interact several times within each environment step. The Actor constructs a new policy, samples a hypothetical action, queries the Critic, and repeats until satisfied or forced to act. This establishes a connection between Bayesian Optimization and Reinforcement Learning.
Summary
A simple modification to Actor-Critic that enables RL agents to detect and correct their own mistakes through metacognitive interaction.
Three Motivating Phenomena
Phenomenon 1: If you ask people to do a task and evaluate their own performance, about half the population’s self-evaluation is better than their task performance. How can this be possible?

Phenomenon 2: If you ask people to do tasks where errors are prone, you’ll notice that a response-locked error negativity signal appears. The brain is signaling that it made a mistake - but how does the brain know?

Phenomenon 3: Via lesion, pharmacology, TMS or other interventions, you can dissociate people’s task performance from their self evaluation. This doesn’t make sense - if I’m better (worse) at knowing whether I err, shouldn’t I be better (worse) at the task?
The Idea
Modify an Actor-Critic so that the Actor and Critic interact several times within each environment step. The Actor constructs a new policy, samples a hypothetical action, queries the Critic, and repeats until satisfied or forced to act.

Connection to Bayesian Optimization
This establishes a connection between Bayesian Optimization and Reinforcement Learning. The critic is the surrogate function and the actor is the acquisition function for the black box optimization problem of action selection: argmax_a Q(s_t, a).

Result: Error Detection and Correction
What do you get if you do this? You get a Reinforcement Learning agent that can (sometimes) detect and correct its own mistakes!

The intuition is simple: in the classic Actor-Critic, say the Actor tries to drive off a cliff and the Critic says not to do that. The agent still drives off the cliff! Here, we give the Actor a chance to take into account what the Critic knows.
The agent detects its own erroneous actions even if the Actor’s policy samples those actions with high probability.

See the full research page for more details.
