Jia-peng Dai
2022/11AlphaGo (2016)
AlphaGo Zero (2017)
AlphaZero (2018)
MuZero (2020)
Jia-peng Dai
2022/11AlphaGo (2016)
AlphaGo Zero (2017)
AlphaZero (2018)
MuZero (2020)
Go is a zero-sum game played turn-by-turn on a 19x19 board. The searching process in Go could be perfectly described as a tree search algorithm.
There are theoretically 361! possible situations in a single game. The key to the problem is to reduce the complexity (e.g. Alpha-Beta Pruning, MCTS).
Two-stage training pipeline
Train 2 policy networks: Rollout policy & SL policy network.
Learn from human expert positions data to predict the distribution of human expert moves (Classification).
Supervised learning process
Using Policy Iteration:
Policy improvement (Policy network);
Policy evaluation (Value network).
Reinforcement learning process
initialized as the SL policy network. The output is the action distribution according to the board state.
Self-play: play games between current policy network and a randomly selected previous iteration of the policy network. Stabilize training by preventing overfitting to the current policy.
Every 500 iterations, add the current parameters ρ to the opponent pool.
Network Architecture
vp=E[zt∣st=s,at…T∼p]
Network Architecture
MCTS is used to simulate potential actions to the best results without traversing every node in the search tree.
Each simulation starts from the root node to the leaf node (unvisited, non-terminal node).
at=aargmax(Q(st,a)+u(st,a))
u(s,a)∝1+N(s,a)P(s,a)
MCTS selection
When visiting to a leaf node, add new node to the search tree.
Store the prior probability of the node.
MCTS expansion
The leaf node is evaluated in two ways:
by the Value network vθ(sL)
by the outcome zL=±1 of a random rollout until terminal step using the fast rollout policy pπ.
V(sL)=(1−λ)vθ(sL)+λzL
MCTS evaluation
At the end of simulation, the action values and visit counts of all traversed edges are updated.
N(s,a)=i=1∑N1(s,a,i)
Q(s,a)=N(s,a)1i=1∑N1(s,a,i)V(sLi)
MCTS backup
Δσ∝∂σ∂logPσ(a∣s)
Δρ∝∂ρ∂logPρ(a∣s)zt
Δθ∝∂θ∂vθ(s)(z−vθ(s))
AlphaGo won Fan Hui, the winner of 2013, 2014 and 2015 European Go championships, by 5:0 in a formal five-game match over 5-9 October 2015.
AlphaGo Zero is trained solely by self-play, without any supervision or use of human data.
AlphaGo Zero only uses the black and white stones from the Go board as its input.
AlphaGo Zero uses a single neural network, rather than separate policy and value networks.
AlphaGo Zero does not use “rollouts” to evaluate positions and sample moves. Instead, it uses a simpler tree search that relies upon this single neural network.
A game terminates when both players pass, when the search value drops below a resignation threshold or when the game exceeds a maximum length.
After each iteration of training, the performance of the new player was measured against the best player; if the new player won by a margin of 55%, then it replaced the best player
zt=±1
(p,v)=fθ(s)
l=(z−v)2−πTlogp+c∣∣θ∣∣2
Self-training pipeline
Basically the same as AlphaGo. Sightly different in evaluate and play phase.
(P(s,⋅),V(s))=fθ(s)
π(a∣s0)=∑bN(s0,b)1/τN(s0,a)1/τ
1 convolutional block followed by 19 or 39 residual block.
Two output heads for computing the policy and value.
Feature size: 19×19×17 (8 stone history x 2 player existence + 1 color), no hand-crafted features as in AlphaGo.
st=[Xt,Yt,Xt−1,Yt−1,…,Xt−7,Yt−7,C]
AlphaGo Zero used a single machine with 4 tensor processing units (TPUs), whereas AlphaGo Lee was distributed over many machines and used 48 TPUs. AlphaGo Zero defeated AlphaGo Lee by 100 games to 0.
a: Five human joseki (common corner sequences) discovered during AlphaGo Zero training.
b: Five joseki favoured at different stages of self-play training.
c: The first 80 moves of three selfplay games that were played at different stages of training.
A more generic version of the AlphaGo Zero that accommodates, without special casing, a broader class of game rules.
Use the same algorithm, network architecture and hyperparameters for chess and shogi, as well as Go.
Go games have a binary win or loss outcome, whereas both chess and shogi may end in drawn outcomes. The game outcome is z={0,±1}: -1 for loss, 0 for draw, 1 for win.
To accommodate a broader class of games, AlphaZero does not assume symmetry. AlphaZero does not augment the training data and does not transform the board position during MCTS.
AlphaZero always generate self-play games with the newest player of last iteration, instead of the best player from all previous iterations as in AlphaGo Zero.
feature size: N×N×(MT+L):
N×N represents the board size.
T represents the board positions history at time steps [t−T+1,…,t].
M represents the presence of player’s pieces.
L denotes the player’s color, the move number and other states of special rules.
Illegal moves are masked out by setting their probabilities to zero, and re-normalising the probabilities over the remaining set of legal moves.
The self-play training process and the neural network architecture (1 Conv block + 19 Res block) are basically the same as AlphaGo Zero.
zt=±1,0
(p,v)=fθ(s)
l=(z−v)2−πTlogp+c∣∣θ∣∣2
Self-play training process in AlphaGo Zero.
AlphaZero used a single machine with 4 TPUs (same as AlphaGo Zero). Milestones:
MuZero extends AlphaZero to a broader set of environments (e.g. Atari2600), including single agent domains and non-zero rewards at intermediate time steps.
No rules were given in advance, MuZero learned the rules by its internal networks.
Overall training process of MuZero.
Aside from (Q,N,P) as before, the edges in MuZero MCTS also store intermediate reward R(s,a) & state transition S(s,a).
The evaluating value is the k-step cumulative discounted reward:
V(st)=τ=0∑k−1γτrt+τ+γkvt+k
Q(s,a)=N(s,a)1i=1∑N1(s,a,i)V(sLi)
Qˉ(s,a)=maxQ−minQQ(s,a)−minQ
Self-play process is basically the same, except intermediate rewards ut are considered.
The trajectories of self-play would be stored into a replay buffer (off-policy RL).
The final outcome zt is computed as the cumulative discounted reward:
zt=ut+1+γut+2+⋯+γn−1ut+n+γnvt+n
A trajectory is sampled from the replay buffer.
l=(z−v)2−πtlogp+(u−r)2+c∣∣θ∣∣2
The network architecture is basically the same as in AlphaZero, but with 16 instead of 20 residual blocks in representation and dynamics networks.
MuZero Reanalyze revisits its past time steps and re-executes MCTS search using the latest model parameters, potentially resulting in a better-quality policy than the original search.
Normal training loop.
Reanalyze training loop.