简体繁体 English

策略梯度算法是否属于强化学习中的无模型或基于模型的方法？

[英]Does policy gradient algorithm comes under model free or model based methods in Reinforcement learning?

原文 2020-02-14 07:21:47 5 1 reinforcement-learning/ markov-decision-process/ mdp

Reinforcement learning algorithms, which explicitly learn system models and use them to solve MDP problems, are model-based methods.强化学习算法，明确地学习系统模型并使用它们来解决 MDP 问题，是基于模型的方法。 Model-based RL has a strong influence from the control theory and is often explained in terms of different disciplines.基于模型的强化学习受到控制理论的强烈影响，并且经常从不同学科的角度进行解释。 These methods include popular algorithms such as the Dyna [Sutton 1991], Q-iteration [Busoniu et al.这些方法包括流行的算法，例如 Dyna [Sutton 1991]、Q 迭代 [Busoniu 等人。 2010], Policy Gradient (PG) [Williams 1992] etc. 2010]，策略梯度 (PG) [Williams 1992] 等。

The model-free methods ignore the model and just focus on figuring out the value functions directly from the interaction with the environment.无模型方法忽略模型，只专注于直接从与环境的交互中找出价值函数。 To accomplish this, the methods depend on sampling and observation heavily;为了实现这一点，这些方法在很大程度上依赖于采样和观察。 thus they don't need to know the inner working of the system.因此他们不需要知道系统的内部工作。 Some examples of these methods are Q-learning [Krose 1995], SARSA [Rummery and Niranjan 1994], and Actor-Critic [Konda and Tsitsiklis 1999].这些方法的一些例子是 Q-learning [Krose 1995]、SARSA [Rummery and Niranjan 1994] 和 Actor-Critic [Konda and Tsitsiklis 1999]。

Other places it is written policy gradient are model free .其他地方写策略梯度是无模型的。 Its confusing can someone clear it as actor critic is also a part of policy gradient algorithms ?由于演员评论家也是策略梯度算法的一部分，所以有人可以清除它吗？

1 个解决方案

Policy Gradient algorithms are model-free.策略梯度算法是无模型的。

In model-based algorithms, the agent has access to or learns the environment's transition function, F(state, action) = reward, next_state.在基于模型的算法中，代理可以访问或学习环境的转换函数 F(state, action) = reward, next_state。 The transition function here can be either deterministic or stochastic.这里的转移函数可以是确定性的，也可以是随机的。

In other words, in model-based algorithms, the agent predicts what's going to happen to the environment if a particular action is taken (such as in this paper: Model Based Reinforcement Learning for Atari ).换句话说，在基于模型的算法中，如果采取了特定的行动，代理会预测环境会发生什么（例如在这篇论文中： Atari 的基于模型的强化学习）。 Alternatively, the agent has access to the transition function according to the framing of the problem (For example, in AlphaGo, the agent has access to the transition function of the Go board).或者，智能体可以根据问题的框架访问转换函数（例如，在 AlphaGo 中，智能体可以访问围棋板的转换函数）。

In policy gradient algorithms, the agent has a policy network for predicting what action to take and a value network for predicting the value of the current state.在策略梯度算法中，代理有一个用于预测采取什么动作的策略网络和一个用于预测当前状态值的价值网络。 Neither of these networks predicts the environment's transition function.这些网络都不能预测环境的转换函数。 Therefore, it's considered model-free.因此，它被认为是无模型的。

You might also find OpenAI Spinning Up's taxonomy diagram helpful.您可能还会发现OpenAI Spinning Up 的分类图很有帮助。