AlphaZero：自對弈期間訪問了哪些節點？

Question

閱讀這篇文章非常有助於更好地理解 AlphaZero 背后的原理。 不過，有些事情我並不完全確定。

下面是作者的UCT_search方法，可以參考他在Github上的代碼： https : //github.com/plkmo/AlphaZero_Connect4/tree/master/src
在這里， UCTNode.backup()將網絡的value_estimate添加到所有遍歷的節點（另請參閱此 “備忘單” ）。

def UCT_search(game_state, num_reads,net,temp):
    root = UCTNode(game_state, move=None, parent=DummyNode())
    for i in range(num_reads):
        leaf = root.select_leaf()
        encoded_s = ed.encode_board(leaf.game); encoded_s = encoded_s.transpose(2,0,1)
        encoded_s = torch.from_numpy(encoded_s).float().cuda()
        child_priors, value_estimate = net(encoded_s)
        child_priors = child_priors.detach().cpu().numpy().reshape(-1); value_estimate = value_estimate.item()
        if leaf.game.check_winner() == True or leaf.game.actions() == []: # if somebody won or draw
            leaf.backup(value_estimate); continue
        leaf.expand(child_priors) # need to make sure valid moves
        leaf.backup(value_estimate)
    return root

這種方法似乎只訪問與根節點直接相連的節點。
然而， DeepMind 的原始論文（關於 AlphaGo Zero）說：

每個模擬從根狀態開始，並迭代地選擇使置信上限 Q(s, a) + U(s, a) 最大化的移動，其中 U(s, a) ∝ P(s, a)/(1 + N (s, a))，直到遇到葉節點 s'。

所以相反，我會期待這樣的事情：

def UCT_search():
    for i in range(num_reads):
        current_node = root
        while current_node.is_expanded:
            …
            current_node = current_node.select_leaf()
        current_node.backup(value_estimate)

^{（ UCTNode.is_expanded如果節點還沒有被訪問過（或者是結束狀態，即游戲結束），則為False}

你能解釋一下為什么會這樣嗎？ 還是我忽略了什么？
提前致謝

Answer 1

您提到的邏輯在select_leaf()方法中，它選擇最好的葉子，而不僅僅是直接連接的節點

AlphaZero：自對弈期間訪問了哪些節點？

問題描述

1 個解決方案

解決方案1
0 2020-05-27 16:44:17

AlphaZero：自對弈期間訪問了哪些節點？

問題描述

1 個解決方案

解決方案1 0 2020-05-27 16:44:17

解決方案1
0 2020-05-27 16:44:17