AI Summit Supplementary Material: Algorithms on the road to Monte-Carlo Tree Search

This content is associated with a 2014 GDC AI Summit lecture on algorithms for dynamic behavior, including Regret Matching, UCB1 and MCTS. The purpose of this page is to provide a few more technical details beyond what is provided in the talk, to provide links to more reading, and to provide demos that allow you to watch how the algorithms perform in rock-paper-scissors. (These demos are available from the tabs above.)

You can download the full slides from the talk. (27MB)

Regret Matching

Regret matching can refer to several similar algorithms that build strategies around the concept of regret. Informally, the regret of an action is how much you wish you played that action when you did something else instead. Two algorithms that play against each other using Regret Matching will have their average strategies converge to a Nash equilibrium. Importantly, regret matching produces a mixed strategy that randomizes between the available actions. Even though a Nash equilibrium won't lose more than it should, it won't always take advantage of suboptimal opponent behavior. Using regret matching over strategies can improve this behavior. (Try the demo links above.)

A simple version of Regret matching works as follows:

Let R(i, t) be the regret of action i at time t.
Let U(a, t) be the utility of taking action a at time t. (This may not be known until time t+1 when we see the opponent action.)
Let R⁺(i, t) be the positive regret of action i at time t, and 0 otherwise.
Let sum(t) = ∑_i R⁺(i, t)
If sum > 0, the probability of taking action i at time t+1 is R⁺(i, t)/sum(t).
If sum <= 0, play actions with equal probability.
After taking an action, a, update the regret of action i using:
R(i, t+1) = R(i, t) + U(i, t) - U(a, t)
This adds to the regret what we would have gained by changing to action i from action a.

This is the simplest version of regret matching. There are a few things that should probably be done if this is being deployed in a game.

You can start with initial regrets of 0, but you can also initialize regrets to induce interesting initial behavior. (If we used 10 for all actions in rock, paper, scissors you would start by playing randomly, and it would take a while to deviate from that strategy.)
You can limit the minimum regret to avoid the algorithm "overlearning" that a particular action is bad. (Try this in the demos -- over the long term this doesn't make regret matching exploitable, but it will over the short term.)
Regret matching requires the ability to introspectively ask what the utility of an alternate action would have been, something not always possible.
Regret matching works well when all actions are reasonable. It will then properly balance the play of the actions to maximize utility.
Utility should be designed to create an interesting experience for the player. For instance, you might lie to the AI about damage so that it consistently overestimates its power and underestimates the power of the opponent.
Regret matching works well in situations where no single action dominates. For example, look at David Sirlin's book on Playing to Win. He describes several situations where regret matching would work well.

Some other notes about regret matching:

Regret matching is one of the key ideas used in poker to solve large games offline.
The version of regret matching implemented here is based on the regret for the utility of the *action* we played. For our examples here this is sufficient, but in general you will want to use regret over the *action distribution*. That is, use the expected utility of all actions that could have been played at the previous time step, instead of just the utility of the action that was played.

Regret Matching References

Hart and Mas-Colell paper which provides the theoretical basis of the regret matching algorithm.
Michael Johanson's MSc thesis on CFR for poker. CFR uses Regret Matching on trees to solve poker games and more accessibe than the previous paper.
A reference on regret-based algorithms.
A survey of regret and bandit algorithms. (This describes exp3, an algorithm that has good theoretical guarantees, but is much more complex that Regret Matching.)
A page about Nash equilibrium for games by Ian Schreiber.

UCB1

UCB1 is a bandit algorithm that can be used to estimate the payoff (utility) associated with a set of actions. Bandit algorithms refer to algorithms that work on a n-armed bandit (slot machine), where you have some time window to pull arms of the slot machine. Each arm has a hidden payoff structure, and the goal is to maximize the payoff at the end of the play session.

While theoretically speaking UCB1 is a history-based algorithm (it determines its next action based on the experienced history of play), it can also be used with simulations of future actions. (A demo of this situation is not yet available here, but examples of this are available in the GDC talk slides..)

UCB1 works as follows:

UCB1 balances exploration and exploitation using confidence bounds. (UCB1 stands for Upper Confidence Bound, which points to the theoretical foundation of the algoirthm.)
Let x(i) be the average utility of action i.
Let c(i) be the count of how many times action i has been tried.
Let t be the total number of times we've tried actions.
UCB1 selects as the next action the one that maximizes x(i) + sqrt(2 * log(t) / c(i))

As UCB1 plays deterministically, its play is completely predictable. Thus, UCB1 should not be used in situations where it needs to mix together actions to play appropriately. When playing Rock-Paper-Scissors, for instance, UCB1 will not play well when given basic actions of Rock, Paper & Scissors. But, if it is given high-level strategies as its actions, it will play much better. (Try these in the demos.)

There are some theoretical caveats to be aware of when using UCB1, but experience has shown that these don't matter as much as you might think. For completeness we point these out here, although for most these won't matter:

UCB1 is not designed for adversarial play, but exp3 (see references above) is. Thus, if we want to use it for adversarial play we want to (1) use more complex strategies instead of low-level actions and (2) all strategies that are available should be "reasonable", in that we won't look stupid for playing any of them.
UCB1 when used as part of MCTS is not trying to maximize the chance of making the right move. This has been fairly widely acknowledged, but in practice it is simple and does a good job.

UCB1 used vary widely as the baseline of predictive algorithms such as Monte Carlo Tree Search (MCTS). In this situation, instead of playing an action and getting a utility back, you simulate an action and the resulting play to estimate the utility of the action. After doing this sufficiently many times, you will have a good estimate of the quality of each action, and then can choose to use that action in practice. This solves the question of how to avoid making bad actions when we can't afford to let the user see our bad actions - we only make bad actions in simulation.

UCB1 references

The original UCB1 paper which presents several different algorithms for different scenarios.
A revised UCB1 paper which modifies the algorithm to improve regret bounds.
mcts.ai, a web page with information and publications about Monte-Carlo tree search.