[英]I am trying to implement PPO from stable baselines3 for my custom environment, I don't understand some commands?
I don't understand the following:我不明白以下内容:
env.close()
#what does this mean? env.close()
#这是什么意思?
model.learn(total_timesteps=1000)
# are total_steps here the number of steps after which the neural network model parameters are updated (ie number of time-steps per episode)? model.learn(total_timesteps=1000)
# 这里的总步数是神经网络 model 参数更新后的步数(即每集的时间步数)?
model = PPO(MlpPolicy, env, verbose=1)
# what is the meaning of verbose=1 here? model = PPO(MlpPolicy, env, verbose=1)
#这里的verbose=1是什么意思?
action, _state = model.predict(obs, deterministic=True)
# what is deterministic=True
doing here? action, _state = model.predict(obs, deterministic=True)
# deterministic=True
在这里做什么? Does deterministic=True
mean that policy is deterministic and not stochastic? deterministic=True
是否意味着政策是确定性的而不是随机的?
Where can I state the number of episodes for which I want to run my experiment?我在哪里可以 state 我想运行我的实验的剧集数?
for i in range(1000):
`action, _states = model.predict(obs)`
`obs, rewards, dones, info = env.step(action)`
`env.render()`
Is 1000 here number of episodes? 1000 这里是集数吗?
Please if someone can clarify these.请如果有人可以澄清这些。
def close(self):
if self.screen is not None:
import pygame
pygame.display.quit()
pygame.quit()
self.isopen = False
total_timesteps
is the total amount of timesteps you'd like to train your agent. total_timesteps
是您要训练代理的总时间步长。 n_steps
is the parameter that is used to decide how often to update a model. n_steps
是用于决定多久更新一次 model 的参数。 Feel free to look at the documentation if you're still confused.verbose (int) – the verbosity level: 0 no output, 1 info, 2 debug
predict()
function, this question doesn't make sense since you can easily call the predict()
function immediately afterward with deterministic=False
and you would get a stochastic action.predict()
function,所以这个问题没有意义,因为您可以轻松地在之后立即调用predict()
function,并且使用deterministic=False
,您会得到一个随机动作。 The model itself is neither stochastic nor deterministic. logits = policy_network(observation)
probs = Categorical(logits=logits)
actions = probs.sample()
As you can see from this code :从这段代码可以看出:
def get_actions(self, deterministic: bool = False) -> th.Tensor:
"""
Return actions according to the probability distribution.
:param deterministic:
:return:
"""
if deterministic:
return self.mode()
return self.sample()
Stable Baselines uses the deterministic
input you mentioned to either call the Categorical distribution's mode()
function or it's sample()
function.稳定基线使用您提到的
deterministic
输入来调用分类分布的mode()
function 或其sample()
function。 The code from both are in Pytorch's documentation :两者的代码都在Pytorch 的文档中:
def sample(self, sample_shape=torch.Size()):
if not isinstance(sample_shape, torch.Size):
sample_shape = torch.Size(sample_shape)
probs_2d = self.probs.reshape(-1, self._num_events)
samples_2d = torch.multinomial(probs_2d, sample_shape.numel(), True).T
return samples_2d.reshape(self._extended_shape(sample_shape))
def mode(self):
return self.probs.argmax(axis=-1)
As you can see, the Categorical distribution's sample()
function just calls Pytorch's torch.multinomial
distribution, which will return a random sample from the multinomial distribution, which is what makes your actions stochastic when deterministic=False
.如您所见,分类分布的
sample()
function 仅调用 Pytorch 的torch.multinomial
分布,它将从多项分布中返回一个随机样本,这就是当deterministic=False
时您的动作随机的原因。 On the other hand, the Categorical distribution's mode()
function just performs an argmax()
operation, which has no randomness and is therefore deterministic.另一方面,分类分布的
mode()
function 只执行argmax()
操作,它没有随机性,因此是确定性的。 Hopefully that explanation was not too complicated.希望这个解释不会太复杂。
from stable_baselines3 import A2C
from stable_baselines3.common.callbacks import StopTrainingOnMaxEpisodes
# Stops training when the model reaches the maximum number of episodes
callback_max_episodes = StopTrainingOnMaxEpisodes(max_episodes=5, verbose=1)
model = A2C('MlpPolicy', 'Pendulum-v1', verbose=1)
# Almost infinite number of timesteps, but the training will stop
# early as soon as the max number of episodes is reached
model.learn(int(1e10), callback=callback_max_episodes)
If you're confused at all about how PPO works or want more questions answered about that, I highly highly HIGHLY suggest reading this article which fully explains how PPO is actually implemented in code, linking all the papers that explain the intuition behind how PPO was created and comes with extremely helpful videos .如果您对 PPO 的工作原理感到困惑或想要回答更多问题,我强烈建议您阅读这篇文章,该文章充分解释了 PPO 是如何在代码中实际实现的,并链接了所有解释 PPO 背后的直觉的论文创建并附带非常有用的视频。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.