Online Algorithms for Video Game AI
Overview| Regret Matching (Actions)| Regret Matching (Strategies)| UCB1 (Actions)| UCB1 (Strategies)
This content is associated with a 2014 GDC AI Summit lecture on algorithms for dynamic behavior, including Regret Matching, UCB1 and MCTS. The purpose of this page is to provide a few more technical details beyond what is provided in the talk, to provide links to more reading, and to provide demos that allow you to watch how the algorithms perform in rock-paper-scissors. (These demos are available from the tabs above.)

You can download the full slides from the talk. (27MB)

Regret Matching

Regret matching can refer to several similar algorithms that build strategies around the concept of regret. Informally, the regret of an action is how much you wish you played that action when you did something else instead. Two algorithms that play against each other using Regret Matching will have their average strategies converge to a Nash equilibrium. Importantly, regret matching produces a mixed strategy that randomizes between the available actions. Even though a Nash equilibrium won't lose more than it should, it won't always take advantage of suboptimal opponent behavior. Using regret matching over strategies can improve this behavior. (Try the demo links above.)

A simple version of Regret matching works as follows:

This is the simplest version of regret matching. There are a few things that should probably be done if this is being deployed in a game.

Some other notes about regret matching:

Regret Matching References

UCB1

UCB1 is a bandit algorithm that can be used to estimate the payoff (utility) associated with a set of actions. Bandit algorithms refer to algorithms that work on a n-armed bandit (slot machine), where you have some time window to pull arms of the slot machine. Each arm has a hidden payoff structure, and the goal is to maximize the payoff at the end of the play session.

While theoretically speaking UCB1 is a history-based algorithm (it determines its next action based on the experienced history of play), it can also be used with simulations of future actions. (A demo of this situation is not yet available here, but examples of this are available in the GDC talk slides..)

UCB1 works as follows:

As UCB1 plays deterministically, its play is completely predictable. Thus, UCB1 should not be used in situations where it needs to mix together actions to play appropriately. When playing Rock-Paper-Scissors, for instance, UCB1 will not play well when given basic actions of Rock, Paper & Scissors. But, if it is given high-level strategies as its actions, it will play much better. (Try these in the demos.)

There are some theoretical caveats to be aware of when using UCB1, but experience has shown that these don't matter as much as you might think. For completeness we point these out here, although for most these won't matter:

UCB1 used vary widely as the baseline of predictive algorithms such as Monte Carlo Tree Search (MCTS). In this situation, instead of playing an action and getting a utility back, you simulate an action and the resulting play to estimate the utility of the action. After doing this sufficiently many times, you will have a good estimate of the quality of each action, and then can choose to use that action in practice. This solves the question of how to avoid making bad actions when we can't afford to let the user see our bad actions - we only make bad actions in simulation.

UCB1 references