强化学习中如何处理不同的状态空间大小？

Question

I'm working in A2C reinforcement learning where my environment has an increasing and decreasing in the number of agents.我正在从事A2C强化学习，我的环境中代理的数量在增加和减少。 As a result of the increasing and decreasing the number of agents, the state space will also change.由于代理数量的增加和减少，状态空间也会发生变化。 I have tried to solve the problem of changing the state space this way:我试图通过这种方式解决改变状态空间的问题：

If the state space exceeds the maximum state space that selected as n_input , the excess state space will be selected by np.random.choice where random choice provides a way of creating random samples from the state space after converting the state space into probabilities.如果状态空间超过选择为n_input的最大状态空间，则多余的状态空间将由np.random.choice选择，其中随机选择提供了一种在将状态空间转换为概率后从状态空间创建随机样本的方法。

If the state space is less than the maximum state I padded the state space with zeros.如果状态空间小于最大状态，我用零填充状态空间。

 def get_state_new(state): n_features = n_input-len(get_state(env)) # print("state",len(get_state(env))) p = np.array(state) p = np.exp(p) if p.sum() != 1.0: p = p * (1. / p.sum()) if len(get_state(env)) > n_input: statappend = np.random.choice(state, size=n_input, p=p) # print(statappend) else: statappend = np.zeros(n_input) statappend[:state.shape[0]] = state return statappend

It works but the results are not as expected and I don't know if this correct or not.它有效，但结果不如预期，我不知道这是否正确。

My question我的问题

Are there any reference papers that deal with such a problem and how to deal with the changing of state space?有没有参考论文处理这样的问题以及如何处理状态空间的变化？

Answer 1

For the paper, I'm gonna give the same reference as in the other post already: Benchmarks for reinforcement learning minmixed-autonomy traffic .对于这篇论文，我将提供与另一篇文章相同的参考：强化学习最小混合自主交通的基准。

In this approach, indeed, an expected number of agents (which are expected to be present in the simulation at any moment in time) is predetermined.在这种方法中，实际上，预期数量的代理（预计在任何时间出现在模拟中）是预先确定的。 During runtime, observations of agents present in the simulation are then retrieved and squashed into a container (tensor) of fixed size (let's call it overall observation container ), which can contain as many observations (from individual agents) as there are agents expected to be present at any moment in time in the simulation.在运行时，模拟中存在的代理的观察然后被检索并压缩到一个固定大小的容器（张量）中（我们称之为整体观察容器），它可以包含与预期代理一样多的观察（来自单个代理）随时出现在模拟中。 Just to be clear: size(overall observation container) = expected number of agents * individual observation size .需要明确的是： size(overall observation container) = expected number of agents * individual observation size 。 Since the actual number of agents present in a simulation may vary from time step to time step, the following applies:由于模拟中存在的实际代理数量可能因时间步长而异，因此以下适用：

If less agents than expected are present in the environment, and hence there are less observations provided than would fit into the overall observation container, then zero-padding is used to fill empty observation slots.如果环境中存在的代理少于预期，因此提供的观察结果少于整个观察容器的数量，则使用零填充来填充空的观察槽。
If the number of agents exceeds the expected number of agents, then only a subset of the observations provided will be used.如果代理的数量超过了预期的代理数量，那么将只使用提供的观察结果的一个子集。 So, only from a randomly selected subset of the available agents the observations are put into the overall observation container of fixed size.因此，只有从可用代理的随机选择子集中，将观察结果放入固定大小的整体观察容器中。 Only for the chosen agents, the controller will compute actions to be performed, while "excess agents" will have to be treated as non-controlled agents in the simulation.仅对于选定的代理，控制器将计算要执行的动作，而“多余的代理”在模拟中必须被视为非受控代理。

Coming back to your sample code, there are a few things I would do differently.回到您的示例代码，我会做一些不同的事情。

First, I was wondering why you have both the variable state (passed to the function get_state_new ) and the call get_state(env) , since I would expect the information returned by get_state(env) to be the same as stored already in the variable state .首先，我想知道为什么您同时拥有变量state （传递给函数get_state_new ）和调用get_state(env) ，因为我希望get_state(env)返回的信息与已存储在变量state . As a tip, it would make the code a bit nicer to read if you could try to use the state variable only (if the variable and the function call indeed provide the same information).作为提示，如果您可以尝试仅使用state变量（如果变量和函数调用确实提供相同的信息），它会使代码更易于阅读。

The second thing I would do differently is how you process states: p = np.exp(p) , p = p * (1. / p.sum()) .我会做的第二件事是处理状态的方式： p = np.exp(p) , p = p * (1. / p.sum()) 。 This normalizes the overall observation container by the sum of all exponentiated values present in all individual observations.这通过所有单独观察中存在的所有取幂值的总和对整体观察容器进行归一化。 In contrast, I would normalize each individual observation in isolation.相比之下，我会孤立地标准化每个单独的观察。

This has the following reason: If you provide a small number of observations, then the sum of exponentiated values contained in all individual observations can be expected to be smaller than when taking the sum over the exponentiated values contained in a larger amount of individual observations.这有以下原因：如果您提供少量观测值，那么包含在所有单个观测值中的取幂值的总和可能会小于对包含在大量单个观测值中的取幂值求和的总和。 These differences in the sum, which is then used for normalization, will result in different magnitudes of the normalized values (as a function of the number of individual observations, roughly speaking).总和中的这些差异然后用于归一化，将导致归一化值的不同幅度（粗略地说，作为个体观察数量的函数）。 Consider the following example:考虑以下示例：

import numpy as np

# Less state representations
state = np.array([1,1,1])
state = state/state.sum()
state
# Output: array([0.33333333, 0.33333333, 0.33333333])

# More state representations
state = np.array([1,1,1,1,1])
state = state/state.sum()
state
# Output: array([0.2, 0.2, 0.2, 0.2, 0.2])

Actually, the same input state representation, as obtained by an individual agent, shall always result in the same output state representation after normalization, regardless of the number of agents currently present in the simulation.实际上，由单个代理获得的相同输入状态表示在归一化后应始终产生相同的输出状态表示，无论模拟中当前存在的代理数量如何。 So, please make sure to normalize all observations on their own.因此，请确保自己对所有观察进行归一化。 I'll give an example below.下面我举一个例子。

Also, please make sure to keep track of which agents' observations (and in which order) have been squashed into your variable statappend .此外，请确保跟踪哪些代理的观察结果（以及以何种顺序）已被压缩到您的变量statappend 。 This is important for the following reason.这很重要，原因如下。

If there are agents A1 through A5 , but the overall observation container can take only three observations, three out of five state representations are going to be selected at random.如果有代理A1到A5 ，但整个观察容器只能接受三个观察，则将随机选择五个状态表示中的三个。 Say the observations randomly selected to be squashed into the overall observation container stem from from the following agents in the following order: A2, A5, A1 .假设随机选择要压缩到整个观察容器中的观察来自以下代理，顺序如下： A2, A5, A1 。 Then, these agents' observations will be squashed into the overall observation container in exactly this order.然后，这些代理的观察将按照这个顺序被压缩到整个观察容器中。 First the observation of A2 , then that of A5 , and eventually that of A1 .首先是A2的观察，然后是A5的观察，最后是A1的观察。 Correspondingly, given the aforementioned overall observation container, the three actions predicted by your Reinforcement Learning controller will correspond to agents A2 , A5 , and A1 (in order!), respectively.相应地，给定上述整体观察容器，您的强化学习控制器预测的三个动作将分别对应于代理A2 、 A5和A1 （按顺序！）。 In other words, the order of the agents on the input side also dictates to which agents the predicted actions correspond on the output side.换句话说，输入端代理的顺序也决定了预测动作对应于输出端的哪些代理。

I would propose something like the following:我会提出类似以下的建议：

import numpy as np

def get_overall_observation(observations, expected_observations=5):
    # Return value:
    #   order_agents: The returned observations stem from this ordered set of agents (in sequence)

    # Get some info
    n_observations = observations.shape[0]  # Actual nr of observations
    observation_size = list(observations.shape[1:])  # Shape of an agent's individual observation

    # Normalitze individual observations
    for i in range(n_observations):
        # TODO: handle possible 0-divisions
        observations[i,:] = observations[i,:] / observations[i,:].max()

    if n_observations == expected_observations:
        # Return (normalized) observations as they are & sequence of agents in order (i.e. no randomization)
        order_agents = np.arange(n_observations)
        return observations, order_agents
    if n_observations < expected_observations:
        # Return padded observations as they are & padded sequence of agents in order (i.e. no randomization)
        padded_observations = np.zeros([expected_observations]+observation_size)
        padded_observations[0:n_observations,:] = observations
        order_agents = list(range(n_observations))+[-1]*(expected_observations-n_observations) # -1 == agent absent
        return padded_observations, order_agents
    if n_observations > expected_observations:
        # Return random selection of observations in random order
        order_agents = np.random.choice(range(n_observations), size=expected_observations, replace=False)
        selected_observations = np.zeros([expected_observations] + observation_size)
        for i_selected, i_given_observations in enumerate(order_agents):
            selected_observations[i_selected,:] = observations[i_given_observations,:]
        return selected_observations, order_agents


# Example usage
n_observations = 5      # Number of actual observations
width = height =  2     # Observation dimension
state = np.random.random(size=[n_observations,height,width])  # Random state
print(state)
print(get_overall_observation(state))

Answer 2

I solve the problem using different solutions but I found that the encoding is the best solution for my problem我使用不同的解决方案解决了问题，但我发现编码是我问题的最佳解决方案

Select the model with pre-estimate maximum state space and If the state space is less than the maximum state, we padded the state space with zeros选择具有预先估计最大状态空间的模型，如果状态空间小于最大状态，我们用零填充状态空间
Consider only the state of the agents itself without any sharing of the other state.只考虑代理本身的状态，而不共享其他状态。
As the paper [1] mentioned that the extra connected autonomous vehicles (CAVs) are not included in the state and if they are less than the max CAVs, the state is padded with zeros.正如论文[1]提到的，额外连接的自动驾驶汽车 (CAV) 不包括在状态中，如果它们小于最大 CAV，则状态用零填充。 We can select how many agents that we can share their state adding to the agent's state.我们可以选择可以共享他们的状态的代理数量，并将其添加到代理的状态中。
Encode the state where it will help us to process the input and compress the information into a fixed length.对状态进行编码，它将帮助我们处理输入并将信息压缩为固定长度。 In the encoder, every cell in the LSTM layer or RNN with Gated Recurrent Units (GRU) returns a hidden state (Ht) and cell state (E't).在编码器中，LSTM 层或带有门控循环单元 (GRU) 的 RNN 中的每个单元都返回一个隐藏状态 (Ht) 和单元状态 (E't)。

For the encoder, I use the Neural machine translation with attention code对于编码器，我使用带有注意力代码的神经机器翻译

class Encoder(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, enc_units, batch_sz):
    super(Encoder, self).__init__()
    self.batch_sz = batch_sz
    self.enc_units = enc_units
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    self.gru = tf.keras.layers.GRU(self.enc_units,
                                   return_sequences=True,
                                   return_state=True,
                                   recurrent_initializer='glorot_uniform')

  def call(self, x, hidden):
    x = self.embedding(x)
    output, state = self.gru(x, initial_state = hidden)
    return output, state

  def initialize_hidden_state(self):
    return tf.zeros((self.batch_sz, self.enc_units))

LSTM zero paddings and mask where we pad the state with a special value to be masked (skipped) later. LSTM 零填充和掩码，我们用一个特殊值填充状态，以便稍后掩码（跳过）。 If we pad without masking, the padded value will be regarded as actual value, thus, it becomes noise in the state [2-4].如果我们不加掩码进行填充，则填充后的值将被视为实际值，从而在状态[2-4] 下变成噪声。

1- Vinitsky, E., Kreidieh, A., Le Flem, L., Kheterpal, N., Jang, K., Wu, C., ... & Bayen, AM (2018, October). 1- Vinitsky, E., Kreidieh, A., Le Flem, L., Kheterpal, N., Jang, K., Wu, C., ... & Bayen, AM（2018 年 10 月）。 Benchmarks for reinforcement learning in mixed-autonomy traffic.混合自主交通中强化学习的基准。 In Conference on Robot Learning (pp. 399-409)在机器人学习会议上（第 399-409 页）

2- Kochkina, E., Liakata, M., & Augenstein, I. (2017). 2- Kochkina, E., Liakata, M., & Augenstein, I. (2017)。 Turing at semeval-2017 task 8: Sequential approach to rumour stance classification with branch-lstm.图灵在 semeval-2017 任务 8：使用 branch-lstm 进行谣言立场分类的顺序方法。 arXiv preprint arXiv:1704.07221. arXiv 预印本 arXiv:1704.07221。

3- Ma, L., & Liang, L. (2020). 3- Ma, L., & Liang, L. (2020)。 Enhance CNN Robustness Against Noises for Classification of 12-Lead ECG with Variable Length.增强 CNN 对噪声的鲁棒性，用于对可变长度的 12 导联心电图进行分类。 arXiv preprint arXiv:2008.03609. arXiv 预印本 arXiv:2008.03609。

4- How to feed LSTM with different input array sizes? 4-如何为 LSTM 提供不同的输入数组大小？

5- Zhao, X., Xia, L., Zhang, L., Ding, Z., Yin, D., & Tang, J. (2018, September). 5- Zhao, X., Xia, L., Zhang, L., Ding, Z., Yin, D., & Tang, J.（2018 年 9 月）。 Deep reinforcement learning for page-wise recommendations.用于逐页推荐的深度强化学习。 In Proceedings of the 12th ACM Conference on Recommender Systems (pp. 95-103).第 12 届 ACM 推荐系统会议论文集（第 95-103 页）。

强化学习中如何处理不同的状态空间大小？

问题描述

2 个解决方案

解决方案1
2 2020-09-06 22:16:23

解决方案2
1 已采纳 2020-10-06 09:59:07

强化学习中如何处理不同的状态空间大小？

问题描述

2 个解决方案

解决方案1 2 2020-09-06 22:16:23

解决方案2 1 已采纳 2020-10-06 09:59:07

解决方案1
2 2020-09-06 22:16:23

解决方案2
1 已采纳 2020-10-06 09:59:07