This content is associated with a 2014 GDC AI Summit lecture on algorithms for
dynamic behavior, including Regret Matching, UCB1 and MCTS. The purpose of this page
is to provide a few more technical details beyond what is provided in the talk,
to provide links to more reading, and to provide demos that allow you to watch how
the algorithms perform in rock-paper-scissors. (These demos are available from the
tabs above.)
You can download the
full slides from the talk. (27MB)
Regret Matching
Regret matching can refer to several similar algorithms that build strategies
around the concept of regret. Informally, the regret of an action is how much
you wish you played that action when you did something else instead. Two algorithms
that play against each other using Regret Matching will have their average strategies
converge to a
Nash equilibrium.
Importantly, regret matching produces a mixed strategy
that randomizes between the available actions.
Even though a Nash equilibrium won't lose more than it should, it won't always
take advantage of suboptimal opponent behavior. Using regret matching over strategies
can improve this behavior. (Try the demo links above.)
A simple version of Regret matching works as follows:
- Let R(i, t) be the regret of action i at time t.
- Let U(a, t) be the utility of taking action a at time t. (This may not be known until time t+1 when we see the opponent action.)
- Let R+(i, t) be the positive regret of action i at time t, and 0 otherwise.
- Let sum(t) = ∑i R+(i, t)
- If sum > 0, the probability of taking action i at time t+1 is R+(i, t)/sum(t).
- If sum <= 0, play actions with equal probability.
- After taking an action, a, update the regret of action i using:
R(i, t+1) = R(i, t) + U(i, t) - U(a, t)
This adds to the regret what we would have gained by changing to action i from action a.
This is the simplest version of regret matching. There are a few things that should
probably be done if this is being deployed in a game.
- You can start with initial regrets of 0, but you can also initialize regrets to
induce interesting initial behavior. (If we used 10 for all actions in rock, paper,
scissors you would start by playing randomly, and it would take a while to deviate
from that strategy.)
- You can limit the minimum regret to avoid the algorithm "overlearning" that a
particular action is bad. (Try this in the demos -- over the long term this doesn't
make regret matching exploitable, but it will over the short term.)
- Regret matching requires the ability to introspectively ask what the utility of
an alternate action would have been, something not always possible.
- Regret matching works well when all actions are reasonable. It will then properly
balance the play of the actions to maximize utility.
- Utility should be designed to create an interesting experience for the player.
For instance, you might lie to the AI about damage so that it consistently
overestimates its power and underestimates the power of the opponent.
- Regret matching works well in situations where no single action dominates.
For example, look at David Sirlin's book on
Playing to Win. He describes several situations where regret matching
would work well.
Some other notes about regret matching:
- Regret matching is one of the key ideas used in poker to solve large games offline.
-
The version of regret matching implemented here is based on
the regret for the utility of the *action* we played. For
our examples here this is sufficient, but in general you
will want to use regret over the *action distribution*. That
is, use the expected utility of all actions that could have
been played at the previous time step, instead of just the
utility of the action that was played.
Regret Matching References
- Hart and Mas-Colell
paper which provides the theoretical basis of the regret matching algorithm.
- Michael Johanson's
MSc thesis on CFR for poker. CFR uses Regret Matching on trees to solve poker games
and more accessibe than the previous paper.
- A reference on regret-based
algorithms.
- A survey
of regret and bandit algorithms. (This describes exp3, an algorithm that has
good theoretical guarantees, but is much more complex that Regret Matching.)
- A page about Nash equilibrium for games
by Ian Schreiber.
UCB1
UCB1 is a bandit algorithm that can be used to estimate the payoff (utility) associated
with a set of actions. Bandit algorithms refer to algorithms that work on a n-armed
bandit (slot machine), where you have some time window to pull arms of the slot
machine. Each arm has a hidden payoff structure, and the goal is to maximize the payoff
at the end of the play session.
While theoretically speaking UCB1 is a history-based algorithm (it determines its
next action based on the experienced history of play), it can also be used with
simulations of future actions. (
A demo of this situation is not yet available here, but examples of this are available in the GDC talk slides..)
UCB1 works as follows:
- UCB1 balances exploration and exploitation using confidence bounds. (UCB1 stands
for Upper Confidence Bound, which points to the theoretical foundation of the algoirthm.)
- Let x(i) be the average utility of action i.
- Let c(i) be the count of how many times action i has been tried.
- Let t be the total number of times we've tried actions.
- UCB1 selects as the next action the one that maximizes x(i) + sqrt(2 * log(t) / c(i))
As UCB1 plays deterministically, its play is completely predictable. Thus, UCB1 should
not be used in situations where it needs to mix together actions to play appropriately.
When playing Rock-Paper-Scissors, for instance, UCB1 will not play well when given
basic actions of Rock, Paper & Scissors. But, if it is given high-level strategies
as its actions, it will play much better. (Try these in the demos.)
There are some theoretical caveats to be aware of when using UCB1, but experience
has shown that these don't matter as much as you might think. For completeness
we point these out here, although for most these won't matter:
- UCB1 is not designed for adversarial play, but exp3 (see references above) is.
Thus, if we want to use it for adversarial play we want to (1) use more complex
strategies instead of low-level actions and (2) all strategies that are available
should be "reasonable", in that we won't look stupid for playing any of them.
- UCB1 when used as part of MCTS is not trying to maximize the chance of making
the right move. This has been fairly widely acknowledged, but in practice it is
simple and does a good job.
UCB1 used vary widely as the baseline of predictive algorithms such as
Monte Carlo Tree Search (MCTS). In this situation, instead of playing an action and
getting a utility back, you simulate an action and the resulting play to estimate
the utility of the action. After doing this sufficiently many times, you will have
a good estimate of the quality of each action, and then can choose to use that action
in practice. This solves the question of how to avoid making bad actions when we can't
afford to let the user see our bad actions - we only make bad actions in simulation.
UCB1 references
- The
original UCB1 paper which presents several different algorithms for different
scenarios.
- A
revised UCB1 paper which modifies the algorithm to improve regret bounds.
- mcts.ai, a web page with
information and publications about Monte-Carlo tree search.