繁体   English   中英

使用 Ray 对强化学习环境进行建模

[英]modeling reinforcement learning environment with Ray

我一直在尝试在一个特定问题上使用强化学习的想法,在这个问题上,我正在优化特定商品的原材料采购策略。 我创建了一个简单的健身房环境来展示我想要完成的简化版本。 目标是接收多个项目(在本例中为 2)并优化每个项目的购买策略,以使所有项目的现有天数总和最小化,而不会用完任何一个项目。

from gym import Env
from gym.spaces import Discrete, Box, Tuple
from gym import spaces
import numpy as np
import random
import pandas as pd
from random import randint

#define our variable starting points


#array of the start quantity for 2 seperate items
start_qty = np.array([10000, 200])
#create the number of simulation weeks
sim_weeks = 1
#set a starting safety stock level------INGORE FOR NOW
#safety_stock = 4003

#create simple demand profile for each item
#demand = np.array([301, 1549, 3315, 0, 1549, 0, 0, 1549, 1549, 1549])
demand = np.array([1800, 45])

#create minimum order and max order quantities for each item
min_ord = np.array([26400, 250])
max_ord = np.array([100000, 100000])
prev_30_usage = np.array([1600, 28])


#how this works is it in the numpy arrays- the stuff in index 0 is the first item's info
# and the stuff in index 1 is the second item's info
class ResinEnv(Env):
    def __init__(self):
        self.action_space = Tuple([Discrete(2), Discrete(2)])
        self.observation_space = Box(low= np.array([-10000000]), high = np.array([10000000]))
        #set the start qty
        self.state = np.array([10000, 200])
        #self.start = start_qty
        #set the purchase length
        self.purchase_length = sim_weeks
        self.min_order = min_ord
    def step(self, action):
        self.purchase_length -= 1
        #apply action 
        self.state[0] -=demand[0]
        self.state[1] -= demand[1]
        #see if we need to buy
        #action is between 0 and 1- round the action to the nearest tenth
        action = np.around(action, decimals = 0)
        
        
        #self.state +=action*self.min_order
        
        np.add(self.state, action* self.min_order, out=self.state, casting="unsafe")
        #self.state += (action*100) + 26400
        #calculate the days on hand from this
        days = self.state/prev_30_usage/7
        
        
        #item_reward1 = action[0]
        #item_reward2 = action[1]
        #calculate reward: right now reward is negative of days_on_hand
        
        #GOING TO NEED TO CHANGE THIS REWARD AT SOME POINT MOVING FORWARD AS IT
        #NEEDS TO TREAT HIGH VOLUME ITEMS AND LOW VOLUME ITEMS THE SAME- THIS IS BIASED AGAINST LOW VOLUME
        if self.state[0] < 0:
            item_reward1 = -10000
        else:
            item_reward1 = days[0]
        if self.state[1]< 0:
            item_reward2 = -10000
        else:
            item_reward2 = days[1]
        
        reward = item_reward1 + item_reward2
        #check if we are out of weeks
        if self.purchase_length<=0:
            done = True
        else:
            done = False
        #reduce the weeks left to purchase by 1 week
        #done = True   
        #set placeholder for info
        info = {}
            
        #return step information
        return self.state, reward, done, info
    def render(self):
        pass
    def reset(self):
        self.state = np.array([10000, 200])
        self.purchase_length = sim_weeks
        self.demand = demand
        self.action_space = Tuple([Discrete(2), Discrete(2)])
        self.min_order= min_ord
        return self.state #, self.purchase_length, self.demand, self.action_space, self.min_order

如以下代码所示,环境似乎运行良好:

episodes = 100
for episode in range(1, episodes+1):
    state = env.reset()
    done = False
    score = 0 
    
    while not done:
        #env.render()
        action = env.action_space.sample()
        n_state, reward, done, info = env.step(action)
        score+=reward
    print('Episode:{} Score:{} Action:{}'.format(episode, score, action))

我试图通过各种建模方式来运行它,但没有运气,并且发现了 Ray,但似乎也无法让它发挥作用。 我想知道是否有人可以指导我完成在 Ray 中的建模过程,或者帮助确定环境本身的任何问题,这些问题会导致 Ray 无法工作。 非常感谢任何帮助,因为我是 RL 的新手并且完全被难住了。

我是 RL 的新手,正在搜索一些代码并找到你的

它接缝你只需要定义 env 因为我添加了这一行并且它工作

....
episodes = 100
env = ResinEnv()
for episode in range(1, episodes+1):
state = env.reset()
done = False
score = 0 
....

希望有用

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM