简体   繁体   English

我正在尝试从稳定的基线 3 为我的自定义环境实施 PPO,但我不懂一些命令?

[英]I am trying to implement PPO from stable baselines3 for my custom environment, I don't understand some commands?

I don't understand the following:我不明白以下内容:

  1. env.close() #what does this mean? env.close() #这是什么意思?

  2. model.learn(total_timesteps=1000) # are total_steps here the number of steps after which the neural network model parameters are updated (ie number of time-steps per episode)? model.learn(total_timesteps=1000) # 这里的总步数是神经网络 model 参数更新后的步数(即每集的时间步数)?

  3. model = PPO(MlpPolicy, env, verbose=1) # what is the meaning of verbose=1 here? model = PPO(MlpPolicy, env, verbose=1) #这里的verbose=1是什么意思?

  4. action, _state = model.predict(obs, deterministic=True) # what is deterministic=True doing here? action, _state = model.predict(obs, deterministic=True) # deterministic=True在这里做什么? Does deterministic=True mean that policy is deterministic and not stochastic? deterministic=True是否意味着政策是确定性的而不是随机的?

  5. Where can I state the number of episodes for which I want to run my experiment?我在哪里可以 state 我想运行我的实验的剧集数?

for i in range(1000):

`action, _states = model.predict(obs)`

`obs, rewards, dones, info = env.step(action)`

`env.render()`

Is 1000 here number of episodes? 1000 这里是集数吗?

Please if someone can clarify these.请如果有人可以澄清这些。

  1. env.close() is dependent on the environment, so it will do different things for each one. env.close() 依赖于环境,所以它会为每个环境做不同的事情。 It basically is used to stop rendering the game, as seen in the code here .它基本上用于停止渲染游戏,如这里的代码所示。
    def close(self):
        if self.screen is not None:
            import pygame

            pygame.display.quit()
            pygame.quit()
            self.isopen = False
  1. total_timesteps is the total amount of timesteps you'd like to train your agent. total_timesteps是您要训练代理的总时间步长。 n_steps is the parameter that is used to decide how often to update a model. n_steps是用于决定多久更新一次 model 的参数。 Feel free to look at the documentation if you're still confused.如果您仍然感到困惑,请随时查看文档
  2. From the documentation, verbosity is how much is printed out about the model at each update:从文档中,详细程度是每次更新时打印出关于 model 的多少:
verbose (int) – the verbosity level: 0 no output, 1 info, 2 debug
  1. No. As you are only calling the model's predict() function, this question doesn't make sense since you can easily call the predict() function immediately afterward with deterministic=False and you would get a stochastic action.不,因为您只是调用模型的predict() function,所以这个问题没有意义,因为您可以轻松地在之后立即调用predict() function,并且使用deterministic=False ,您会得到一个随机动作。 The model itself is neither stochastic nor deterministic. model 本身既不是随机的也不是确定的。 PPO, in particular, gets actions by first inputting an observation to the actor's network, which will output what are called logits, or unnormalized log action probabilities.特别是 PPO,通过首先将观察输入到参与者的网络来获取操作,这将 output 称为 logits 或非规范化日志操作概率。 Those logits are then passed through a Categorical distribution which is essentially a Softmax() operation to get the action probability distribution.然后将这些 logits 通过一个 分类分布,该分布本质上是一个 Softmax() 操作以获得动作概率分布。 A simple pseudocode example to get actions from the policy's network would be as follows:从策略网络获取操作的简单伪代码示例如下:
logits = policy_network(observation)
probs = Categorical(logits=logits)
actions = probs.sample()

As you can see from this code :从这段代码可以看出:

    def get_actions(self, deterministic: bool = False) -> th.Tensor:
        """
        Return actions according to the probability distribution.
        :param deterministic:
        :return:
        """
        if deterministic:
            return self.mode()
        return self.sample()

Stable Baselines uses the deterministic input you mentioned to either call the Categorical distribution's mode() function or it's sample() function.稳定基线使用您提到的deterministic输入来调用分类分布的mode() function 或其sample() function。 The code from both are in Pytorch's documentation :两者的代码都在Pytorch 的文档中:

    def sample(self, sample_shape=torch.Size()):
        if not isinstance(sample_shape, torch.Size):
            sample_shape = torch.Size(sample_shape)
        probs_2d = self.probs.reshape(-1, self._num_events)
        samples_2d = torch.multinomial(probs_2d, sample_shape.numel(), True).T
        return samples_2d.reshape(self._extended_shape(sample_shape))
    def mode(self):
        return self.probs.argmax(axis=-1)

As you can see, the Categorical distribution's sample() function just calls Pytorch's torch.multinomial distribution, which will return a random sample from the multinomial distribution, which is what makes your actions stochastic when deterministic=False .如您所见,分类分布的sample() function 仅调用 Pytorch 的torch.multinomial分布,它将从多项分布中返回一个随机样本,这就是当deterministic=False时您的动作随机的原因。 On the other hand, the Categorical distribution's mode() function just performs an argmax() operation, which has no randomness and is therefore deterministic.另一方面,分类分布的mode() function 只执行argmax()操作,它没有随机性,因此是确定性的。 Hopefully that explanation was not too complicated.希望这个解释不会太复杂。

  1. This isn't something that can be simply done by passing a parameter, you need to implement a StopTrainingOnMaxEpisodes custom callback, as per the documentation .这不是可以简单地通过传递参数来完成的事情,您需要根据文档实现 StopTrainingOnMaxEpisodes 自定义回调。 There is also a simple code example in the documentation that I will just echo here for clarity:文档中还有一个简单的代码示例,为了清楚起见,我将在这里重复:
from stable_baselines3 import A2C
from stable_baselines3.common.callbacks import StopTrainingOnMaxEpisodes

# Stops training when the model reaches the maximum number of episodes
callback_max_episodes = StopTrainingOnMaxEpisodes(max_episodes=5, verbose=1)

model = A2C('MlpPolicy', 'Pendulum-v1', verbose=1)
# Almost infinite number of timesteps, but the training will stop
# early as soon as the max number of episodes is reached
model.learn(int(1e10), callback=callback_max_episodes)

If you're confused at all about how PPO works or want more questions answered about that, I highly highly HIGHLY suggest reading this article which fully explains how PPO is actually implemented in code, linking all the papers that explain the intuition behind how PPO was created and comes with extremely helpful videos .如果您对 PPO 的工作原理感到困惑或想要回答更多问题,我强烈建议您阅读这篇文章,该文章充分解释了 PPO 是如何在代码中实际实现的,并链接了所有解释 PPO 背后的直觉的论文创建并附带非常有用的视频

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM