简体   繁体   中英

Please could someone explain what this numpy related code does or mean?

Please what do the following lines of code from the code further down mean:

 avg= np.mean(a[np.where(a[:,0]== u[0])][:,1])

bestArm = u[0]

 choice = np.where(arms == np.random.choice(arms))[0][0]

runningMean = np.mean(av[:,1])

These lines of code came from a machine learning reinforcement learning program shown below:

import numpy as np
import random
import matplotlib.pyplot as plt
%matplotlib inline

np.random.seed(5)

n= 10
arms= np.random.rand(n)
eps= 0.1 #probability of exploration action


def reward(prob):
    reward = 0
    for i in range(10):
        if random.random() < prob:
            reward += 1
    return reward

#initialize memory array; has 1 row defualted to random action index
av = np.array([np.random.randint(0, (n+ 1)), 0]).reshape(1,2) #av = action-value
#greedy method to select best arm based on memory array
def bestArm(a):
    bestArm = 0 #default to o
    bestMean= 0
    for u in a:
        avg= np.mean(a[np.where(a[:,0]== u[0])][:,1]) # calculate mean reward for each action
        if bestMean < avg:
            bestMean = avg
            bestArm = u[0]
    return bestArm        
        
        
     
        
plt.xlabel('Number of times played')
plt.ylabel('Average Reward')
for i in range(500):
    if random.random() > eps: #greedy exploitation action
        choice= bestArm(av)
        thisAV= np.array([[choice, reward(arms[choice])]])
        av= np.concatenate((av, thisAV), axis= 0)
    else:
        choice = np.where(arms == np.random.choice(arms))[0][0]
        thisAV= np.array([[choice, reward(arms[choice])]]) #choice , rewaard
        av= np.concatenate((av, thisAV), axis = 0) # add to our action value memory array
        
        # calculate the mean reward
        runningMean = np.mean(av[:,1])
        plt.scatter(i, runningMean)

Please i will appreciate assistance on this, as i have tried googling to understand those lines of code, but was not totally satisfied with the answers i got. Thanks.

avg= np.mean(a[np.where(a[:,0]== u[0])][:,1])

From the inside out, a[:,0]==u[0] produces an array of True and False values where the first column of a equals the value in u[0] np.where returns the list of indices where that array contains True . Taking a[np.where...] returns only those elements of a , and the [:,1] returns the second column in that subarray. So, you're taking the average of the rows in the second column of a where the first column of a equals u[0] .

choice = np.where(arms == np.random.choice(arms))[0][0]

From the inside out, np.random.choice(arms) picks an element at random from the arms array. arms == np.random.choice(arms) returns True for those rows of arms that match the random choice, and False otherwise. The np.where , again, returns the indices of the rows that were True. So, this is basically you are returning the [0][0] element from an element of arms chosen at random.

runningMean = np.mean(av[:,1])

This takes the average of the second column of av .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM