您如何在 Ray Tune 中的自定义 Gym 环境中使用 OpenAI Gym“包装器”？

Question

您如何在Ray Tune中的自定义 Gym 环境中使用 OpenAI Gym“包装器”？

假设我构建了一个名为 CustomEnv 的 Python CustomEnv （类似于用于创建 OpenAI Gym "CartPole-v1"环境的“ CartPoleEnv ”class）来创建我自己的（自定义）强化学习环境，我正在使用tune.run()来自Ray Tune （在 Ray 2.1.0 和 Python 3.9.15 中）使用“ PPO ”算法在我的环境中训练代理：

import ray
from ray import tune
tune.run(
        "PPO",                         # 'PPO' algorithm
        config={"env": CustomEnv,      # custom class used to create an environment
            "framework": "tf2",
            "evaluation_interval": 100, 
            "evaluation_duration": 100,
            },
        checkpoint_freq = 100,             # Save checkpoint at every evaluation
        local_dir=checkpoint_dir,          # Save results to a local directory
        stop{"episode_reward_mean": 250},  # Stopping criterion
        )

这很好用，我可以使用TensorBoard来监控训练进度等，但事实证明，学习速度很慢，所以我想尝试使用 Gym 的“包装器”来衡量观察、奖励和/或行动，限制方差和加速学习。 所以我有一个 ObservationWrapper、一个 RewardWrapper 和一个 ActionWrapper 来执行此操作——例如，像这样的东西（缩放的确切性质不是我的问题的核心）：

import gym

class ObservationWrapper(gym.ObservationWrapper):
    def __init__(self, env):
        super().__init__(env)
        self.o_min = 0.
        self.o_max = 5000.

    def observation(self, ob):
        # Normalize observations
        ob = (ob - self.o_min)/(self.o_max - self.o_min)
        return ob

class RewardWrapper(gym.RewardWrapper):
    def __init__(self, env):
        super().__init__(env)
        self.r_min = -500
        self.r_max = 100

    def reward(self, reward):
        # Scale rewards:
        reward = reward/(self.r_max - self.r_min)
        return reward

class ActionWrapper(gym.ActionWrapper):
    def __init__(self, env):
        super().__init__(env)

    def action(self, action):
        # Scale actions
        action = action/10
        return action

当我在本地计算机上创建 class 的实例并将其用于传统训练循环时，像这样的包装器可以很好地与我的自定义 class 配合使用，如下所示：

from my_file import CustomEnv

env = CustomEnv()
wrapped_env = ObservationWrapper(RewardWrapper(ActionWrapper(env)))
episodes = 10

for episode in range(1,episodes+1):
    obs = wrapped_env.reset()
    done = False
    score = 0
    
    while not done:
        action = wrapped_env.action_space.sample()
        obs, reward, done, info = wrapped_env.step(action)
        score += reward

    print(f'Episode: {episode},  Score: {score:.3f}')

我的问题是：如何将这些包装器与我的自定义 class ( CustomEnv ) 和ray.tune()使用？ 此特定方法期望“env”的值被传递为 (1) 作为 class（例如CustomEnv ）或 (2) 作为与注册的 Gym 环境相关联的字符串（例如"CartPole-v1" ），因为我在尝试各种不正确的方法来传递我的自定义 class 的包装版本时发现：

ValueError: >>> is an invalid env specifier. You can specify a custom env as either a class (e.g., YourEnvCls) or a registered env id (e.g., "your_env").

所以我不确定该怎么做（假设有可能）。 我宁愿在不必注册我的自定义 Gym 环境的情况下解决这个问题，但我对任何解决方案都持开放态度。

在学习包装器时，我主要利用了Ayoosh Kathuria 的“OpenAI Gym 入门：基本构建块”和“用于强化学习的 TF 2.0 ：健身房包装器”。

Answer 1

我能够回答我自己的问题，即如何让 Ray 的tune.run()与 Gym 环境的包装自定义 class 一起工作。 Ray Environments的文档很有帮助。

解决方案是通过Ray注册自定义class。 假设您已经按照上面的讨论定义了 Gym 包装器（类），它的工作原理如下：

from ray.tune.registry import register_env
from your_file import CustomEnv             # import your custom class

def env_creator(env_config):
    # wrap and return an instance of your custom class
    return ObservationWrapper(RewardWrapper(ActionWrapper(CustomEnv())))

# Choose a name and register your custom environment
register_env('WrappedCustomEnv-v0', env_creator)

现在，在tune.run()中，您可以像提交任何其他已注册的 Gym 环境一样提交已注册实例的名称：

import ray
from ray import tune

tune.run(
        "PPO",                          # 'PPO' algorithm (for example)
        config={"env": "WrappedCustomEnv-v0", # the registered instance
            #other options here as desired
            },
        # other options here as desired
        )

tune.run()将正常工作——问题已解决！

您如何在 Ray Tune 中的自定义 Gym 环境中使用 OpenAI Gym“包装器”？

问题描述

1 个解决方案

解决方案1
0 已采纳 2022-12-03 01:36:32

您如何在 Ray Tune 中的自定义 Gym 环境中使用 OpenAI Gym“包装器”？

问题描述

1 个解决方案

解决方案1 0 已采纳 2022-12-03 01:36:32

解决方案1
0 已采纳 2022-12-03 01:36:32