^{1}

^{*}

^{1}

^{1}

^{1}

^{2}

^{3}

^{1}

^{2}

^{3}

Edited by: Balaraman Ravindran, Indian Institute of Technology Madras, India

Reviewed by: Michelangelo Ceci, University of Bari Aldo Moro, Italy; Ameet Soni, Swarthmore College, United States; Manan Tomar, Facebook, United States, in collaboration With Reviewer Ameet Soni

This article was submitted to Machine Learning and Artificial Intelligence, a section of the journal Frontiers in Artificial Intelligence

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

Deep neural networks have been successfully applied in learning the board games Go, chess, and shogi without prior knowledge by making use of reinforcement learning. Although starting from zero knowledge has been shown to yield impressive results, it is associated with high computationally costs especially for complex games. With this paper, we present

The project

We approach the problem by training a deep neural network from human game data, in a similar fashion as

First, we introduce a more compact input board presentation by making the state fully Markovian and removing the history component.

Second, we highlight the importance of input preprocessing in form of rescaling or normalization for significant better performance.

Third, we present a new more powerful and more efficient neural network architecture based on advancements in computer vision such as grouped depthwise convolutions, pre-activation resnets, and squeeze-excitation layers.

Fourth, we investigate several techniques to make the Monte Carlo tree search (MCTS) more sample efficient. This includes the usage of

Finally, we evaluate the strength of a neural network in combination with MCTS with expert human players as well as the most common crazyhouse chess engines.

We proceed as follows. We start off, in section 2, by briefly reviewing prior work in computer crazyhouse and machine learning in games. Section 3 then goes through the general scheme on how the neural network is trained and integrated with MCTS to be used as an engine. Our input representation for encoding the board state is introduced in section 4, and our output representation in section 5. Section 6 goes over the

Crazyhouse is a fairly recent chess variant, which primarily enjoys popularity in on-line chess servers such as ^{1}^{2}^{3}^{2}

Generally, machine learning in computer chess (Skiena,

Monte-Carlo Tree Search (MCTS; Kocsis and Szepesvári,

Recently, ^{4}^{5}^{6}

Our crazyhouse bot

Compilation pipeline of the

_{θ}(

Actually, the neural network is a shared network, which both predicts the value and policy (section 6) for a given board state in a single forward pass. The loss function is defines as follows

where _{2} regularization constant for the network weights θ, respectively. We set α to be 0.01 in order to avoid overfitting to the training values as suggested by Silver et al. (

The trained (deep) neural network of _{θ} which guides the search in each step. For each iteration _{t}, the following UCT-like formula (Equation 2) is used for selecting the next action _{t} leading to a new state _{t+1}.

This move selection continues until a new previously unexplored state ^{*} or a terminal node _{T} is reached. The state ^{*} is then expanded, the corresponding position is evaluated using the learned neural network, and the received value evaluation is propagated through all nodes on the path beginning from the current position ^{*} back to the root position _{0}. In case of a terminal node _{T} the static evaluation according to the game rules being either −1, 0, or +1 is used instead. After enough samples have been taken, the most promising move, i.e., the node which has been visited most often, is played on the board (section 9).

^{7}^{8}^{9}^{10}

Let us now dive into the details of each component of

The input to

Plane representation for crazyhouse.

P1 piece | 6 | Bool | Order: { |

P2 piece | 6 | Bool | Order: { |

Repetitions* | 2 | Bool | Indicates how often the board positions has occurred |

P1 pocket count* | 5 | Int | Order: { |

P2 pocket count* | 5 | Int | Order: { |

P1 Promoted Pawns | 1 | Bool | Indicates pieces which have been promoted |

P2 Promoted Pawns | 1 | Bool | Indicates pieces which have been promoted |

En-passant square | 1 | Bool | Indicates the square where en-passant capture is possible |

Color* | 1 | Bool | All zeros for black and all ones for white |

Total move count* | 1 | Int | Sets the full move count (FEN notation) |

P1 castling* | 2 | Bool | Binary plane, order: { |

P2 castling* | 2 | Bool | Binary plane, order: { |

No-progress count* | 1 | Int | Sets the no progress counter (FEN halfmove clock) |

To make the board presentation fully Markovian, we add an additional feature channel, which highlights the square for an en-passent capture if possible. In contrast to standard chess, piece promotions are a lot more common and often occur multiple times in a game. If a promoted piece gets captured, it will be demoted to a pawn again and added to the pocket of the other player. To encode this behavior, we highlight each square of a piece that has been promoted using a binary map for each player. Overall, the representation to ^{11}

In over-the-board (OTB) games, the players see their pieces usually in the first rank. We make use of symmetries by flipping the board representation on the turn of the second player, so that the starting square of the queen is always to the left of the king for both players.

All input planes are scaled to the linear range of [0, 1]. For most computer vision tasks, each channel of an image is encoded as 8 bit per pixel resulting in a discrete value between [0, 255]. Commonly, these input images are divided by 255 to bring each pixel in the [0, 1] range. For ImageNet (Russakovsky et al., ^{12}

To illustrate the benefits of including this step, we conducted the following experiments, learning curves shown in _{1} = 0.9, β_{2} = 0.999, ϵ = 10−8 and Stochastic Gradient Descent with Neterov's Momentum (NAG; Botev et al., ^{−4} and a batch size of 1024, one can make the following observations. Both optimizers highly benefit in terms of convergence speed and final convergence when using our input pre-processing step. ADAM gains +2, 3% whereas NAG gains +9, 2% move prediction accuracy. The ADAM optimizer is much more robust when dealing with the unnormalized feature set due to its internal automatic feature rescaling, but is outperformed in terms of generalization ability when using NAG with our defined learning and momentum schedule. This agrees with research on different optimizer (Keskar and Socher,

Essentially, we treat learning to play crazyhouse as modified computer vision problem where the neural network _{θ} conducts a classification for the policy prediction combined with a regression task for the value evaluation.

Plane representation of an exemplary game position of our test data set, FEN:

Next, the activation maps of a fully trained network (section 6) for a single prediction are shown in

Activation maps of model

Comparison of the activation map shapes for different architecture models. All models process the same input representation via a sequence of residual blocks followed by a value and policy head.

The final spatial activation map of the value head are summed along the first axis and overlayed on top of the given board position in

Based on the games of

The value output, which represents the winning chances for a game position, represents as a single floating point value in the range [−1, +1] as described in section 3.1.

For the policy output we investigate two conceptually different representations. First, the policy head consists of a single convolutional layer with a variable amount of feature channels followed by a fully connected layer storing all 2272 theoretical plausible moves as described in UCI-notation. Each entry in this vector stands for a particular predefined move and queen promoting moves are treated as individual moves, denoted with a suffix ^{13}

In the second version, the policy head representation directly encodes each of the 2272 moves on a particular square on 81 channels of 8 ×8 planes: first come the queen moves in all compass direction, followed by all knight moves for each square, promoting moves, and finally all dropping moves. Queen moves also include all pawn, bishop, rook, and king moves. We spent three additional channels for queen promotion moves, to be consistent with the UCI-move-representation^{14}

In the UCI-move-representation, en-passant moves and castling moves do not have separate vector indices and king side and queen side castling is denoted as the move

Finding a (deep) neural network for playing board games covers three criteria, which determine the playing strength when using the learnt model in MCTS. First, the performance, i.e., the policy and value loss on the test data set is the main component for predicting the best moves at each roll-out. Second, a better performance is often associated with a longer training and inference time leading to less evaluations per second during search. Third, the memory consumption per prediction specifies the maximum achievable batch size.

To address these points,

Residual connections (He et al.,

We train several different architectures on the same training set with the same optimizer settings^{15}

Specifically, model

First we tried training the original

256 ×8 ×8 | conv 3 ×3, 256 | ||||

256 ×8 ×8 | |||||

1 | 2272/ |
see |

It also uses Squeeze Excitation Layers (SE; Hu et al.,

Model

Dong et al. (

Furthermore, motivated by the empirical findings^{16}^{17}^{18}

Our proposed model^{19}

Now that the architectures are in place, let us turn toward the data used for training. As training data we mainly used 569537 human games^{20}^{21}

In crazyhouse, the opening advantage for the first player is even more pronounced and the chance to draw is significantly reduced compared to chess. We used all games ending in checkmate, resignation, a draw or time forfeit except aborted games. All games and moves in our data set are equally weighted in the loss function (1).

The average Elo rating for a single game is 2279.61 and very short time controls are the most popular.

One minute games make up 45.15% of all games. As a consequence the chances of blundering is increased and the quality of moves usually declines over a course of a game. Some games are also won by winning on time in a lost position in which one person gives as many possible checks.

Additionally, however, we also generated a data set based on 121571 ^{22}

We trained the resulting models for seven epochs, using a one cycle learning rate schedule combined with a momentum schedule (Smith and Topin, ^{−4} for regularization. The first iterations were treated as a warm-up period followed by a linear long cool-down period. The maximum and minimum learning rate (

Schedules used for modifying the parameters of Nesterov's stochastic gradient descent optimizer.

As a new metric, we define the “

Learning progress of training for seven epochs on the

Detailed view on the last iterations of training for different model architectures.

Furthermore, the value loss decreased over the course of a game. After training, we evaluated our models using a mate in one data set (

Performance metrics for different models on the

Combined loss | 1.2138 | 1.2166 | 1.1964 | 1.1925 | |

Policy loss | 1.2184 | 1.2212 | 1.2008 | 1.1968 | |

Value loss | 0.7596 | 0.7601 | 0.7619 | 0.7663 | |

Policy accuracy | 0.5986 | 0.5965 | 0.6023 | 0.6032 | |

Value accuracy sign | 0.6894 | 0.6888 | 0.6889 | 0.6868 | |

Mate-in-one-accuracy | 0.954 | 0.942 | 0.945 | ||

Mate-in-top-5-accuracy | 0.998 | 0.998 | |||

Mate-in-one-value-loss | 0.0532 | 0.0560 | 0.0567 | 0.0624 |

Model

For training on the ^{22}

Only relying on the initial prior policy _{θ} for playing will remain unsatisfactory in a complex environment such as crazyhouse. Therefore, the policy is now improved using Monte Carlo Tree Search (MCTS). An overview of the MCTS hyperparameters and their respective values is available in

For the MCTS we used the most recent version of the PUCT algorithm due (Silver et al.,

Specifically, a node is selected at each step _{t} = argmax_{a}(Q(_{t}, _{t}, _{t}, _{t} for every available action _{puct-init} = 2.5 as our

with a _{puct-base} of 19652. We apply dirichlet noise with a factor of 25% to the probability distribution

However, to really master the game of crazyhouse at the level of a world champion, we also had to modify standard MTCS in several ways that we now discuss.

For selecting the final move after search, the vanilla version uses an equation only based on the number of visits for each direct child node

where τ is a temperature parameter which controls the exploration. In tournament play against other engines, τ is set to 0 which results in choosing the move with the highest number of visits.

We investigated taking information about the

Silver et al. (_{thresh}max_{a}(_{0}, _{thresh} to 0.33 and denote these updated _{0},

The

The

where _{t} or a terminal node _{T} has been reached. As can be seen in ^{23}

Self Elo increase with respect to nodes.

Match results of different MCTS move selection types playing each setting against 800 simulations per move using only the most visited node.

Default (visits only) | 800 | − | − | − | 0 |

Default (visits only) | 1600 | 80 | 20 | 0 | 240.82 ± 88.88 |

Default (visits only) | 2400 | 93 | 6 | 1 | |

Visits & |
800 | 51 | 48 | 1 | 10.43 ± 68.57 |

Visits & |
1600 | 87 | 12 | 1 | |

Visits & |
2400 | 91 | 8 | 0 | 422.38 ± 148.92 |

Visits & q-pv-values | 800 | 57 | 41 | 2 | |

Visits & q-pv-values | 1600 | 87 | 13 | 0 | 330.23 ± 110.06 |

Visits & q-pv-values | 2400 | 90 | 10 | 0 | 381.70 ± 128.31 |

Move selection which also makes use of _{thresh}, similar to Equation (3), and keep Q_{factor} fixed at 0.7:

where Q_{thresh-init} is 0.5, Q_{thresh-max} is 0.9, and Q_{thresh-base} is 1965.

To achieve a better comparability with other crazyhouse engines, we convert our final

where λ is 1.2.

We also integrate a basic time management system, which does not search moves based on a fixed number of nodes, but on a variable move time _{current}. This makes the engine applicable on different hardware and better suited for playing other engines. There are two main time settings for engine games. In basic one, a predefined constant time is assigned for a given number of moves (e.g., 40). In sudden death mode a total game time is given to both players for all moves. Optionally, the time can be increased by a certain amount on each move, also called increment, in order to reduce the risk of losing on time in long games.

Our time management uses in principle a fixed move time, which depends on the expected game length, remaining time, and increment. We allocate a constant move time and add 70% of the increment time per move. For sudden death games, we assume a game length of 50 and switch to proportional based system at move 40. In the proportional system, we allocate 5% of the available move time and therefore assume that the game will last 20 moves longer. This formula models the empirical distribution of the expected number of moves to the end of the game as presented by Vučković and Šolak (

Moreover, we stop the search immediately if there is only a single legal move available (e.g., the king is in check with only one escape square) or prematurely at half the allocated time for

Last, for games with human players, we also adjust the allocated time per move

where _{factor}~[−0.1, +0.1] to increase playing variety.

Furthermore, we introduce a transposition table, which stores a pointer to all unique nodes in the tree. Since the number of nodes is several magnitudes lower than for alpha beta engines, its memory size is negligible. Transposition tables are used in most modern chess engines as a look-up table to store and reuse the value evaluation for already explored nodes. In the case of MCTS, we can reuse already existing policy prediction vectors as well as value evaluations. Additionally, one might copy the move generation instead of recomputing it. Transpositions occur frequently during search in chess games and can increase the evaluated nodes per second by a factor of two or more on certain position. Because our input representation depends on the movement counter as well as the no-progress counter, we only take transpositions into account where these counters share the same value. f the node being in the set of its parent nodes could even result in a better performance.

We also notice that for certain positions, if a node was found with a

This increases the chance of exploring unvisited nodes at least once, according to the principle

“

A _{divisor} <1 increases the need of exploring unvisited nodes and can help to reduce the chance of missing key moves, but comes at the cost of losing search depth. To avoid over-exploration at nodes with a low visits count, we reduce _{divisor} over time, similar to Equation (11):

where _{min} is 0.25, _{init} is 1, and _{base} is 1965.

Checks are usually important moves in crazyhouse and it can have detrimental effects if these are missed during search. To ensure that checks have been sufficiently explored, we add the option to enhance the prior probability for all checking moves _{check'}(_{tresh} by

where we set _{tresh} to 0.1, _{factor} to 0.5, and renormalize the distribution afterwards. This modification has the following motivations: the preassigned order for checking moves should be preserved, but checking moves with a low probability are preferred over low confidence non-checking moves. The top-candidate non-checking moves should remain as the move with the highest prior probability.

Including this step might not be beneficial as soon as our network _{θ} reaches a certain level of play, but it provides guarantees and was found to greatly increase the efficiency in positions where a forced mate is possible or in which the opponent is threatening a forced mating sequence. Further, we disable any exploration for a particular node as soon as a move was found which leads to a winning terminal node. A direct checkmate is a case which is known to be the best move for all available moves, so additional exploration is unneeded an can only distort the value evaluation.

Before moving on to our empirical evaluation, let us discuss the pros and cons of the techniques used in

As mentioned, alpha-beta engines are strongest at open tactical positions. This holds particularly true for finding forced sequences such as a long series of checks. For example, it is common for

In contrast, MCTS shows the opposite behavior and shares similar strength and weakness when compared to human play. It exhibits a significantly lower number of node evaluation and is generally inferior in solving tactical positions quickly if the tactic does not follow its current pattern recognition. On the other hand, it is often better at executing long term strategies and sacrifices because its search is guided by a non-linear policy and is able to explore promising paths more deeply. Alpha-beta engines commonly purely rely on handcrafted linear value evaluations and there is no flow of information in a proposed principal evaluation. The non-linear, more costly value evaluation function can also allow it to vary between small nuances in similar positions. Additionally, (deep) neural networks are able to implicitly build an opening book based on supervised training data or self-play, whereas traditional alpha-beta engines need to search a repeating position from scratch or have to store evaluations in a look-up table which is linearly increasing in size.

In crazyhouse the importance of tactics is increased compared to classical chess and generally when a certain tactic has been missed, the game is essentially decided in engine vs. engine games. Stronger strategic skills result in long grinding games, building up a small advantage move by move, and usually take longer to execute.

MCTS is naturally parallelizable and can make use of every single node evaluation while minimax-search needs to explore a full depth to update its evaluation and is harder to parallelize. Also the computational effort increases exponentially for each depth in minimax-search. Besides that, MCTS returns a distribution over all legal moves which can be used for sampling like in reinforcement learning. As a downside, this version of MCTS highly depends on the probability distribution _{θ} which can have certain blind spots of missing critical moves. Minimax search with a alpha-beta pruning is known to have the problem of the horizon effect where a miss-leading value evaluation is given because certain tactics have not been resolved. To counteract this, minimax-based algorithms employ quiescence search to explore certain moves at greater depth (Kaindl,

Search tree statistics over time for the Move ^{25}

There are 73 moves available and ^{rd} candidate move.

The convergence for the node visits behaves linearly as soon as the move

When comparing our changes to MCTS, we notice the following: in the

Our intention here is to evaluate the playing strength of

Over the course of its development, ^{24}

We also evaluate the playing strength of

All engines including ^{25}^{26}

To get a reference point, DeepMind's

We played 10 matches with each engine starting from five common crazyhouse opening positions. Each position was played twice, one in which

The results of the matches (see ^{27}

Match results of

CrazyAra | 0.6.0 | – | 0.00033 | 62 ± 18 | 74 ± 24 | 118 | 0 | 12 |

PyChess | 1.1 | > 1566.25^{*} |
0.012 | – | 41 ± 11 | 0 | 0 | 10 |

KKFChess^{†} |
2.6.7b | 1849.50 | 2.9 | – | 57 ± 10 | 0 | 0 | 10 |

TSCP ZH | 1.1 | 1888.47 | 0.5 | – | 54 ± 12 | 0 | 0 | 10 |

Pulsar | 2009-9b | 1982.07 | ? | – | 61 ± 13 | 0 | 0 | 10 |

Feuerstein^{†} |
0.4.6.1 | 2205.74 | 0.1 | – | 57 ± 8 | 0 | 0 | 10 |

Nebiyu^{†} |
1.45a | 2244.39 | 1.5 | – | 42 ± 8 | 0 | 0 | 10 |

SjaakII | 1.4.1 | 2245.56 | 0.425 | – | 61 ± 10 | 0 | 0 | 10 |

Sjeng | 11.2 | 2300.00 | 0.7 | – | 66 ± 15 | 0 | 0 | 10 |

CrazyWa | 1.0.1 | 2500.00 | 1.4 | – | 73 ± 10 | 0 | 0 | 10 |

Sunsetter | 9 | 2703.39 | 1.5 | – | 74 ± 19 | 0 | 0 | 10 |

TjChess | 1.37 | 2732.58 | 1.37 | 53 ± 0 | 85 ± 17 | 1 | 0 | 9 |

Imortal^{†} |
4.3 | > 2997.33^{*} |
0.9 | 134 ± 0 | 77 ± 9 | 1 | 0 | 9 |

Stockfish | 10 (2018-11-29) | > 3946.06^{*} |
4.6 | 70 ± 16 | – | 10 | 0 | 0 |

The games between ^{28}

Match results of

CrazyAraFish | 0.6.0 | – | 0.0014 | 118 ± 22 | 98 ± 34 | 3 | 1 | 6 |

Stockfish | 10 (2018-11-29) | > 3,946.06 | 6.7 | 98 ± 34 | 118 ± 22 | 6 | 1 | 3 |

In this work we have developed a crazyhouse chess program, called

All games of the presented results in this article as well as supplementary figures and tables are included in the article/

The project was initially started by JC, MW, and AB as part of the course Deep Learning: Architectures & Methods held by KK and JF in summer 2018. This paper was mainly written by JC with the continuous support and revision of KK and JF.

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The authors thank the main

The Supplementary Material for this article can be found online at:

^{1}

^{2}

^{3}

^{4}

^{5}

^{6}See e.g.,

^{7}

^{8}

^{9}

^{10}

^{11}More information on the FEN notation can, e.g., be found at:

^{12}According to the 50 move rule a value of 50 is recommended instead.

^{13}

^{14}Details can be found in

^{15}

^{16}

^{17}

^{18}

^{19}Inference time comparison can be found in

^{20}

^{21}Statistics about the data set can be found in

^{22}The model used was based on our RISEv1 architecture:

^{23}Matches are available in

^{24}All games can be found in the

^{25}The FEN for this position is:

^{26}

^{27}

^{28}We also tried choosing a higher hash size for