[英]modeling reinforcement learning environment with Ray
我一直在尝试在一个特定问题上使用强化学习的想法,在这个问题上,我正在优化特定商品的原材料采购策略。 我创建了一个简单的健身房环境来展示我想要完成的简化版本。 目标是接收多个项目(在本例中为 2)并优化每个项目的购买策略,以使所有项目的现有天数总和最小化,而不会用完任何一个项目。
from gym import Env
from gym.spaces import Discrete, Box, Tuple
from gym import spaces
import numpy as np
import random
import pandas as pd
from random import randint
#define our variable starting points
#array of the start quantity for 2 seperate items
start_qty = np.array([10000, 200])
#create the number of simulation weeks
sim_weeks = 1
#set a starting safety stock level------INGORE FOR NOW
#safety_stock = 4003
#create simple demand profile for each item
#demand = np.array([301, 1549, 3315, 0, 1549, 0, 0, 1549, 1549, 1549])
demand = np.array([1800, 45])
#create minimum order and max order quantities for each item
min_ord = np.array([26400, 250])
max_ord = np.array([100000, 100000])
prev_30_usage = np.array([1600, 28])
#how this works is it in the numpy arrays- the stuff in index 0 is the first item's info
# and the stuff in index 1 is the second item's info
class ResinEnv(Env):
def __init__(self):
self.action_space = Tuple([Discrete(2), Discrete(2)])
self.observation_space = Box(low= np.array([-10000000]), high = np.array([10000000]))
#set the start qty
self.state = np.array([10000, 200])
#self.start = start_qty
#set the purchase length
self.purchase_length = sim_weeks
self.min_order = min_ord
def step(self, action):
self.purchase_length -= 1
#apply action
self.state[0] -=demand[0]
self.state[1] -= demand[1]
#see if we need to buy
#action is between 0 and 1- round the action to the nearest tenth
action = np.around(action, decimals = 0)
#self.state +=action*self.min_order
np.add(self.state, action* self.min_order, out=self.state, casting="unsafe")
#self.state += (action*100) + 26400
#calculate the days on hand from this
days = self.state/prev_30_usage/7
#item_reward1 = action[0]
#item_reward2 = action[1]
#calculate reward: right now reward is negative of days_on_hand
#GOING TO NEED TO CHANGE THIS REWARD AT SOME POINT MOVING FORWARD AS IT
#NEEDS TO TREAT HIGH VOLUME ITEMS AND LOW VOLUME ITEMS THE SAME- THIS IS BIASED AGAINST LOW VOLUME
if self.state[0] < 0:
item_reward1 = -10000
else:
item_reward1 = days[0]
if self.state[1]< 0:
item_reward2 = -10000
else:
item_reward2 = days[1]
reward = item_reward1 + item_reward2
#check if we are out of weeks
if self.purchase_length<=0:
done = True
else:
done = False
#reduce the weeks left to purchase by 1 week
#done = True
#set placeholder for info
info = {}
#return step information
return self.state, reward, done, info
def render(self):
pass
def reset(self):
self.state = np.array([10000, 200])
self.purchase_length = sim_weeks
self.demand = demand
self.action_space = Tuple([Discrete(2), Discrete(2)])
self.min_order= min_ord
return self.state #, self.purchase_length, self.demand, self.action_space, self.min_order
如以下代码所示,环境似乎运行良好:
episodes = 100
for episode in range(1, episodes+1):
state = env.reset()
done = False
score = 0
while not done:
#env.render()
action = env.action_space.sample()
n_state, reward, done, info = env.step(action)
score+=reward
print('Episode:{} Score:{} Action:{}'.format(episode, score, action))
我试图通过各种建模方式来运行它,但没有运气,并且发现了 Ray,但似乎也无法让它发挥作用。 我想知道是否有人可以指导我完成在 Ray 中的建模过程,或者帮助确定环境本身的任何问题,这些问题会导致 Ray 无法工作。 非常感谢任何帮助,因为我是 RL 的新手并且完全被难住了。
我是 RL 的新手,正在搜索一些代码并找到你的
它接缝你只需要定义 env 因为我添加了这一行并且它工作
....
episodes = 100
env = ResinEnv()
for episode in range(1, episodes+1):
state = env.reset()
done = False
score = 0
....
希望有用
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.