The game of Tic Tac Toe is very simple. There are many ways to write a perfect AI for it. You can code strategies for each possible type of board (there are multiple symmetries, so the number of different boards is quite limited), or you can even brute-force calculate all possible outcomes from a given board. As an exercise in reinforcement learning, which we talked about in a previous blog post, we wrote a machine learning algorithm that teaches itself Tic Tac Toe.
To test it out, we started by letting the AI play 5,000 games against itself. The playing policy for this learning phase was trivial: both sides took completely random moves at all times. We called one side AI, the other side Player to differentiate between them. In each of the 5,000 games, Player had the first move. And the results were devastating for our AI. It lost 58.84% of all games and tied 12.74%.
The reason for this is simple. When both sides play randomly, the side that begins has a huge advantage. It turns out that this advantage is about 2:1, Player won twice as often as AI. But unlike Player, AI took notes. During the learning phase it kept track of the various game states and figured out which states are desirable and may lead to wins. After the learning phase was over, we allowed the AI to use its newly found knowledge. In the following 5,000 games we let it pick the move it liked the most while Player (still machine-controlled) continued to play randomly. As before, Player always had the first move.
The final stats for this round looked considerably different. AI now won 83.24% of all games and tied 6.80%. Now you may ask yourself why the AI lost some games against a player who makes random moves. Should it not always at least tie? The answer to this is quite interesting: based on what the AI experienced during training, it concluded (correctly) that its opponent plays randomly.
Imagine that there are 6 free squares. Two Os are placed such that at third O could win the game for Player, and one X is placed. Since the AI assumes that its opponent plays randomly, there is a good chance that Player’s next move won’t complete its three Os. Therefore, the AI may ignore this risk and prefer to put down an X such that itself has the opportunity of winning with its very next move (whenever possible it combines both goals: stopping the three Os while also setting itself up for a possible win).
The AI always weighs the risk-reward of winning and losing, and depending on how aggressive we make it, it may care more about winning than it cares about not losing. In this particular case we had set the positive reward for winning twice as high as the negative reward for losing. We also specified a negative reward for ties. Because of this reward structure, our AI really, really wants to win rather than tie, so it’s willing to take some risks and thus sometimes loses.
By massaging the reward structure, we can control the play style of the AI. If we want it to lose less, we can make the negative reward for losing much more negative. As a result, it will start to tie more, win less and lose less. While Masters is considerably more complex than Tic Tac Toe, the underlying decision process for an AI is similar. In both cases the machine must figure out an ideal action given a specific situation. In the end, hopefully this behavior resembles that of a human.