Why do AlphaGo and AlphaGo Zero include board history in the input features

Question

Both AlphaGo and AlphaGo Zero include prior board states as input features (the "Turns Since" planes for AlphaGo, and the repeated 8-step history planes for AlphaGo Zero).

What is the purpose of including this history information in the input to the neural networks?

If we ignore the ko rule, the best move at a position should not depend on the history of moves leading up to that position. If we don't ignore ko, a single step of board history should be sufficient in the vast majority of games, so including 8 steps of history seems excessive (and possibly even harmful, since if the same position were reached by two different paths, the learned response might not be shared between those two paths).

This doesn't seem to be discussed in either of the papers, or in any of the media reporting that I have seen.

Dennis Soemers · Answer 1 · 2019-01-11T14:02:45.273

The primary reason for including a history of states is likely indeed the ko rule. Even if having a long history will often be redundant, it's unlikely to hurt either (except that it might take some computation time... but that doesn't appear to be a major concern throughout the entirety of the papers).

I imagine that having a history of states can also help the Neural Network to more easily "focus" on important areas of the board during training (maybe only in the beginning of the training process, maybe for a longer time). You're right that, from a game-theoretic point of view, the current game state should be sufficient (ignoring rare cases). However, the learned components of AlphaGo Zero are not optimal in a game-theoretic sense early on in training (and likely still aren't after training either). Architectural choices in the Neural Network that may be redundant in a game-theoretic sense may still be beneficial for more rapid learning.

For example, in early stages of learning when learned components are still performing poorly, I imagine a useful heuristic may be to pay more "attention" to areas of the board where moves have recently been made; these are more likely to be important areas of the board than some different area of the board where nothing is going on (especially for an amateur-level component in the beginning of a learning process). Such areas of the board can easily be identified by subtracting game states in the recent history from each other.

I would not be surprised if this indeed ends up happening to some extent during the learning process. Stochastic Gradient Descent tends to, intuitively, take the path of "least resistance" and greedily optimizes parameters wherever it happens to notice some correlation between inputs and outputs. Of course this whole argument is rather handwavy, and it also may not at all be happening... I suppose the main point is what I put in boldface above. Neural Network-based components are not optimal in a game-theoretic sense, and they're also trained using algorithms (SGD-style) that may take suboptimal paths to global minima or not even find a global minimum at all. Modifications that would not be necessary for an "optimal" solution may still help to more easily or more quickly find good solutions.

David Silver, one of the first authors on both of the AlphaGo papers, describes a similar hypothesis to what I described as "focus" (more commonly described as "attention") here (emphasis mine):

Actually, the representation would probably work well with other choices than 8 planes! But we use a stacked history of observations for three reasons: 1. it is consistent with common input representations in other domains (e.g. Atari), 2. we need some history to represent ko, 3. it is useful to have some history to have an idea of where the opponent played recently - these can act as a kind of attention mechanism (i.e. focus on where my opponent thinks is important).

score 2 · Accepted Answer · answered Dec 04 '18 at 00:23

"This doesn't seem to be discussed in either of the papers, "

Yes, it is discussed in at least one of the papers.

Here is an excerpt taken from Mastering the Game of Go without Human Knowledge - DeepMind by David Silver, Julian Schrittwieser, Karen Simonyan, et al. The emphasis on "repetitions are forbidden" is added by me.

History features $X_t, Y_t$ are necessary because Go is not fully observable solely from the current stones, as repetitions are forbidden; similarly, the colour feature $C$ is necessary because the komi is not observable.

The usual case of forbidden repetitions comes with a ko, as mentioned in the question and the other answer. There are other repetitions, such as eternal life and triple ko.

Why do AlphaGo and AlphaGo Zero include board history in the input features

2 Answers2