$\DeclareMathOperator*{\argmax}{argmax}$

Explanation of:
Multi-agent Reinforcement Learning in Sequential Social Dilemmas

Leibo et al. 2017.

I've found that the overwhelming majority of online information on artificial intelligence research falls into one of two categories: the first is aimed at explaining advances to lay audiences, and the second is aimed at explaining advances to other researchers. I haven't found a good resource for people with a technical background who are unfamiliar with the more advanced concepts and are looking for someone to fill them in. This is my attempt to bridge that gap, by providing approachable yet (relatively) detailed explanations. In this post, I explain the titular paper - Multi-agent Reinforcement Learning in Sequential Social Dilemmas.


Motivation

Game theory, "the study of mathematical models of conflict and cooperation between intelligent rational decision-makers," fascinating field. evolved from von neumann. social dillemas - rational for the group to do economics, political science, international relations, biology


At the heart of game theory is the idea of modeling agents' expected rewards using a matrix that assigns payoffs to specific actions (that usually depend on other agents' actions). The canonical Prisoner's dilemma is one such example. If you aren't familiar with this example, Wikipedia explains it far better than I could; what's important to understand is that in the Prisoner's dilemma, if each agent acts rationally i.e. to improve his or her own outcome, the group will collectively receive a worse outcome than if each agent had chosen a subpar action. Because we can construct matrices that capture these expected values, this type of situation is often referred to as a Matrix Game Social Dilemma (MGSD).


Leibo et al.'s paper argues that MGSDs fail to capture many essential aspects of real world social dilemmas. They outline four reasons:

  1. "Real world social dilemas are temporally extended." I don't think that this is a valid criticism of MGSDs
  2. Cooperation and defection are labels that apply to policies implementing strategic decisions.
  3. Cooperativeness may be a graded quantity.
  4. Decisions to cooperate or defect occur only quasi-simultaneously since some information about what player 2 is starting to do can inform player 1’s decision and vice versa.
  5. Decisions must be made despite only having partial information about the state of the world and the activities of the other players.

Background

Intuition

Mathematics

Markov Games

Honestly,


Sequential Social Dilemma


Deep Multiagent Reinforcement Learning


Experiments and Results

Gathering

Wolfpack

Discussion

Summary

Notes

I appreciate any and all feedback. If I've made an error or if you have a suggestion, you can email me or comment on the Reddit or HackerNews threads.