Please what do the following lines of code from the code further down mean:
avg= np.mean(a[np.where(a[:,0]== u[0])][:,1])
bestArm = u[0]
choice = np.where(arms == np.random.choice(arms))[0][0]
runningMean = np.mean(av[:,1])
These lines of code came from a machine learning reinforcement learning program shown below:
import numpy as np
import random
import matplotlib.pyplot as plt
%matplotlib inline
np.random.seed(5)
n= 10
arms= np.random.rand(n)
eps= 0.1 #probability of exploration action
def reward(prob):
reward = 0
for i in range(10):
if random.random() < prob:
reward += 1
return reward
#initialize memory array; has 1 row defualted to random action index
av = np.array([np.random.randint(0, (n+ 1)), 0]).reshape(1,2) #av = action-value
#greedy method to select best arm based on memory array
def bestArm(a):
bestArm = 0 #default to o
bestMean= 0
for u in a:
avg= np.mean(a[np.where(a[:,0]== u[0])][:,1]) # calculate mean reward for each action
if bestMean < avg:
bestMean = avg
bestArm = u[0]
return bestArm
plt.xlabel('Number of times played')
plt.ylabel('Average Reward')
for i in range(500):
if random.random() > eps: #greedy exploitation action
choice= bestArm(av)
thisAV= np.array([[choice, reward(arms[choice])]])
av= np.concatenate((av, thisAV), axis= 0)
else:
choice = np.where(arms == np.random.choice(arms))[0][0]
thisAV= np.array([[choice, reward(arms[choice])]]) #choice , rewaard
av= np.concatenate((av, thisAV), axis = 0) # add to our action value memory array
# calculate the mean reward
runningMean = np.mean(av[:,1])
plt.scatter(i, runningMean)
Please i will appreciate assistance on this, as i have tried googling to understand those lines of code, but was not totally satisfied with the answers i got. Thanks.
avg= np.mean(a[np.where(a[:,0]== u[0])][:,1])
From the inside out, a[:,0]==u[0]
produces an array of True and False values where the first column of a
equals the value in u[0]
np.where
returns the list of indices where that array contains True
. Taking a[np.where...]
returns only those elements of a
, and the [:,1]
returns the second column in that subarray. So, you're taking the average of the rows in the second column of a
where the first column of a
equals u[0]
.
choice = np.where(arms == np.random.choice(arms))[0][0]
From the inside out, np.random.choice(arms)
picks an element at random from the arms
array. arms == np.random.choice(arms)
returns True for those rows of arms
that match the random choice, and False otherwise. The np.where
, again, returns the indices of the rows that were True. So, this is basically you are returning the [0][0] element from an element of arms
chosen at random.
runningMean = np.mean(av[:,1])
This takes the average of the second column of av
.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.