简体   繁体   English

我应该如何使用 Q-learning 编写赌徒问题(没有任何强化学习包)?

[英]How should I code the Gambler's Problem with Q-learning (without any reinforcement learning packages)?

I would like to solve the Gambler's problem as an MDP (Markov Decision Process).我想用 MDP(马尔可夫决策过程)来解决赌徒的问题。

Gambler's problem: A gambler has the opportunity to make bets on the outcomes of a sequence of coin flips.赌徒的问题:赌徒有机会对一系列掷硬币的结果下注。 If the coin comes up heads, he wins as many dollars as he has staked on that flip;如果硬币正面朝上,他赢的钱与他在该掷硬币上的赌注一样多; if it is tails, he loses his stake.如果是反面,他将失去赌注。 The game ends when the gambler wins by reaching his goal of κ dollars, or loses by running out of money.游戏结束时,赌徒达到他的目标 κ 美元获胜,或者因为钱用完而失败。 On each flip, the gambler must decide how many (integer) dollars to stake.在每次翻转时,赌徒必须决定下注多少(整数)美元。 The probability of heads is p and that of tails is 1 − p.正面的概率是 p,反面的概率是 1 - p。

I implemented the modell-free Q-learning method using a totally random base policy.我使用完全随机的基础策略实现了无模型 Q 学习方法。 But the code is not working as I hoped and I can't figure out why.但是代码没有像我希望的那样工作,我不知道为什么。 Thank you for any suggestions.感谢您的任何建议。 :) :)

import numpy as np
import numpy as np
import matplotlib.pyplot as plt
import random

#data
kappa=100 #goal
p=0.25  #probability of the head, winning
eps=0.1 #0.1, 0.005 epsilon
gamma=0.9 #discount factor
alpha=0.1 # 0.1, 1, 10 learning rate
n=1000 #number of training episodes

#Q-learning with totally random base policy
S = [*range(0,kappa+1)] 
A = [*range(0,kappa+1)]

R=np.zeros((kappa+1,kappa+1))
for i in A:
    R[kappa,i]=1

Q=np.zeros((kappa+1,kappa+1))
optimal_policy=np.zeros(kappa+1)

for sa in range(1,kappa):
    i=0
    while i<n:
        s=sa
        while True:
            #choose a random action
            seged=min(s,kappa-s)
            a=np.random.randint(low=1,high=seged+1) #the maximum of my stake is the coins I own
            #take action, observe the state
            rand=random.uniform(0,1)
            if rand < p: #if I win, I got more coins
                s_next = s + a
            else: #if I loose, I loose the stake
                s_next = s - a
                
            Q[s,a]=Q[s,a]+alpha*(R[s_next,a]+(gamma*max(Q[s_next,b] for b in range(0,s_next+1))-Q[s,a]))
            
            if s_next==0:
                break
            if s_next==kappa:
                i=i+1
                break 
            s = s_next
            
for s in range(1,kappa+1):
    optimal_policy[s]=np.argmax(Q[s,])
Q=np.round(Q,2)
print(Q)
print(optimal_policy)

x = np.array(range(0, kappa+1))
y = optimal_policy
plt.xlabel("Amount available (Current State)")
plt.ylabel('Recommended betting amount')
plt.title("Optimal policy: Random base policy (p=" + str(p)+", \u03B1=" + str(alpha)+")")
plt.scatter(x, y)
plt.show()

The problem seems to be that your while i<n loop never terminates.问题似乎是您的while i<n循环永远不会终止。

It looks like you accidentally wait until the first win before incrementing i .看起来您不小心等到第一次获胜后才增加i (You forgot to increment i when the episode ends with a loss.) To avoid this mistake, I suggest to write that loop as for i in range(n) instead of incrementing i before each break . (当情节以失败结束时,您忘记增加i 。)为了避免这个错误,我建议将循环编写为for i in range(n)而不是在每次break之前增加i

This first win never happens, because when starting with 1 dollar, and a win probability of 25%, it is (in practice) impossible to win this game.第一场胜利永远不会发生,因为以 1 美元和 25% 的获胜概率开始时,(实际上)不可能赢得这场比赛。 This also means that your first few iterations (starting with little money) will not learn anything because they never win.这也意味着你的前几次迭代(从很少的钱开始)不会学到任何东西,因为他们永远不会赢。 The R[] is always zero, and there is no signal in the Q[] table yet to propagate between states. R[]始终为零,并且Q[]表中没有信号在状态之间传播。

What I did to figure this out, was simply to insert some statements like print('i:', i) into the code.我所做的只是在代码中插入一些诸如print('i:', i)之类的语句。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM