The primary reason for including a history of states is likely indeed the ko rule. Even if having a long history will often be redundant, it's unlikely to hurt either (except that it might take some computation time... but that doesn't appear to be a major concern throughout the entirety of the papers).
I imagine that having a history of states can also help the Neural Network to more easily "focus" on important areas of the board during training (maybe only in the beginning of the training process, maybe for a longer time). You're right that, from a game-theoretic point of view, the current game state should be sufficient (ignoring rare cases). However, the learned components of AlphaGo Zero are not optimal in a game-theoretic sense early on in training (and likely still aren't after training either). Architectural choices in the Neural Network that may be redundant in a game-theoretic sense may still be beneficial for more rapid learning.
For example, in early stages of learning when learned components are still performing poorly, I imagine a useful heuristic may be to pay more "attention" to areas of the board where moves have recently been made; these are more likely to be important areas of the board than some different area of the board where nothing is going on (especially for an amateur-level component in the beginning of a learning process). Such areas of the board can easily be identified by subtracting game states in the recent history from each other.
I would not be surprised if this indeed ends up happening to some extent during the learning process. Stochastic Gradient Descent tends to, intuitively, take the path of "least resistance" and greedily optimizes parameters wherever it happens to notice some correlation between inputs and outputs. Of course this whole argument is rather handwavy, and it also may not at all be happening... I suppose the main point is what I put in boldface above. Neural Network-based components are not optimal in a game-theoretic sense, and they're also trained using algorithms (SGD-style) that may take suboptimal paths to global minima or not even find a global minimum at all. Modifications that would not be necessary for an "optimal" solution may still help to more easily or more quickly find good solutions.
David Silver, one of the first authors on both of the AlphaGo papers, describes a similar hypothesis to what I described as "focus" (more commonly described as "attention") here (emphasis mine):
Actually, the representation would probably work well with other choices than 8 planes! But we use a stacked history of observations for three reasons: 1. it is consistent with common input representations in other domains (e.g. Atari), 2. we need some history to represent ko, 3. it is useful to have some history to have an idea of where the opponent played recently - these can act as a kind of attention mechanism (i.e. focus on where my opponent thinks is important).