[英]How to deal with different state space size in reinforcement learning?
I'm working in A2C reinforcement learning where my environment has an increasing and decreasing in the number of agents.我正在从事A2C强化学习,我的环境中代理的数量在增加和减少。 As a result of the increasing and decreasing the number of agents, the state space will also change.由于代理数量的增加和减少,状态空间也会发生变化。 I have tried to solve the problem of changing the state space this way:我试图通过这种方式解决改变状态空间的问题:
If the state space exceeds the maximum state space that selected as n_input
, the excess state space will be selected by np.random.choice
where random choice provides a way of creating random samples from the state space after converting the state space into probabilities.如果状态空间超过选择为n_input
的最大状态空间,则多余的状态空间将由np.random.choice
选择,其中随机选择提供了一种在将状态空间转换为概率后从状态空间创建随机样本的方法。
If the state space is less than the maximum state I padded the state space with zeros.如果状态空间小于最大状态,我用零填充状态空间。
def get_state_new(state): n_features = n_input-len(get_state(env)) # print("state",len(get_state(env))) p = np.array(state) p = np.exp(p) if p.sum() != 1.0: p = p * (1. / p.sum()) if len(get_state(env)) > n_input: statappend = np.random.choice(state, size=n_input, p=p) # print(statappend) else: statappend = np.zeros(n_input) statappend[:state.shape[0]] = state return statappend
It works but the results are not as expected and I don't know if this correct or not.它有效,但结果不如预期,我不知道这是否正确。
My question我的问题
Are there any reference papers that deal with such a problem and how to deal with the changing of state space?有没有参考论文处理这样的问题以及如何处理状态空间的变化?
For the paper, I'm gonna give the same reference as in the other post already: Benchmarks for reinforcement learning minmixed-autonomy traffic .对于这篇论文,我将提供与另一篇文章相同的参考:强化学习最小混合自主交通的基准。
In this approach, indeed, an expected number of agents (which are expected to be present in the simulation at any moment in time) is predetermined.在这种方法中,实际上,预期数量的代理(预计在任何时间出现在模拟中)是预先确定的。 During runtime, observations of agents present in the simulation are then retrieved and squashed into a container (tensor) of fixed size (let's call it overall observation container ), which can contain as many observations (from individual agents) as there are agents expected to be present at any moment in time in the simulation.在运行时,模拟中存在的代理的观察然后被检索并压缩到一个固定大小的容器(张量)中(我们称之为整体观察容器),它可以包含与预期代理一样多的观察(来自单个代理)随时出现在模拟中。 Just to be clear: size(overall observation container) = expected number of agents * individual observation size
.需要明确的是: size(overall observation container) = expected number of agents * individual observation size
。 Since the actual number of agents present in a simulation may vary from time step to time step, the following applies:由于模拟中存在的实际代理数量可能因时间步长而异,因此以下适用:
Coming back to your sample code, there are a few things I would do differently.回到您的示例代码,我会做一些不同的事情。
First, I was wondering why you have both the variable state
(passed to the function get_state_new
) and the call get_state(env)
, since I would expect the information returned by get_state(env)
to be the same as stored already in the variable state
.首先,我想知道为什么您同时拥有变量state
(传递给函数get_state_new
)和调用get_state(env)
,因为我希望get_state(env)
返回的信息与已存储在变量state
. As a tip, it would make the code a bit nicer to read if you could try to use the state
variable only (if the variable and the function call indeed provide the same information).作为提示,如果您可以尝试仅使用state
变量(如果变量和函数调用确实提供相同的信息),它会使代码更易于阅读。
The second thing I would do differently is how you process states: p = np.exp(p)
, p = p * (1. / p.sum())
.我会做的第二件事是处理状态的方式: p = np.exp(p)
, p = p * (1. / p.sum())
。 This normalizes the overall observation container by the sum of all exponentiated values present in all individual observations.这通过所有单独观察中存在的所有取幂值的总和对整体观察容器进行归一化。 In contrast, I would normalize each individual observation in isolation.相比之下,我会孤立地标准化每个单独的观察。
This has the following reason: If you provide a small number of observations, then the sum of exponentiated values contained in all individual observations can be expected to be smaller than when taking the sum over the exponentiated values contained in a larger amount of individual observations.这有以下原因:如果您提供少量观测值,那么包含在所有单个观测值中的取幂值的总和可能会小于对包含在大量单个观测值中的取幂值求和的总和。 These differences in the sum, which is then used for normalization, will result in different magnitudes of the normalized values (as a function of the number of individual observations, roughly speaking).总和中的这些差异然后用于归一化,将导致归一化值的不同幅度(粗略地说,作为个体观察数量的函数)。 Consider the following example:考虑以下示例:
import numpy as np
# Less state representations
state = np.array([1,1,1])
state = state/state.sum()
state
# Output: array([0.33333333, 0.33333333, 0.33333333])
# More state representations
state = np.array([1,1,1,1,1])
state = state/state.sum()
state
# Output: array([0.2, 0.2, 0.2, 0.2, 0.2])
Actually, the same input state representation, as obtained by an individual agent, shall always result in the same output state representation after normalization, regardless of the number of agents currently present in the simulation.实际上,由单个代理获得的相同输入状态表示在归一化后应始终产生相同的输出状态表示,无论模拟中当前存在的代理数量如何。 So, please make sure to normalize all observations on their own.因此,请确保自己对所有观察进行归一化。 I'll give an example below.下面我举一个例子。
Also, please make sure to keep track of which agents' observations (and in which order) have been squashed into your variable statappend
.此外,请确保跟踪哪些代理的观察结果(以及以何种顺序)已被压缩到您的变量statappend
。 This is important for the following reason.这很重要,原因如下。
If there are agents A1
through A5
, but the overall observation container can take only three observations, three out of five state representations are going to be selected at random.如果有代理A1
到A5
,但整个观察容器只能接受三个观察,则将随机选择五个状态表示中的三个。 Say the observations randomly selected to be squashed into the overall observation container stem from from the following agents in the following order: A2, A5, A1
.假设随机选择要压缩到整个观察容器中的观察来自以下代理,顺序如下: A2, A5, A1
。 Then, these agents' observations will be squashed into the overall observation container in exactly this order.然后,这些代理的观察将按照这个顺序被压缩到整个观察容器中。 First the observation of A2
, then that of A5
, and eventually that of A1
.首先是A2
的观察,然后是A5
的观察,最后是A1
的观察。 Correspondingly, given the aforementioned overall observation container, the three actions predicted by your Reinforcement Learning controller will correspond to agents A2
, A5
, and A1
(in order!), respectively.相应地,给定上述整体观察容器,您的强化学习控制器预测的三个动作将分别对应于代理A2
、 A5
和A1
(按顺序!)。 In other words, the order of the agents on the input side also dictates to which agents the predicted actions correspond on the output side.换句话说,输入端代理的顺序也决定了预测动作对应于输出端的哪些代理。
I would propose something like the following:我会提出类似以下的建议:
import numpy as np
def get_overall_observation(observations, expected_observations=5):
# Return value:
# order_agents: The returned observations stem from this ordered set of agents (in sequence)
# Get some info
n_observations = observations.shape[0] # Actual nr of observations
observation_size = list(observations.shape[1:]) # Shape of an agent's individual observation
# Normalitze individual observations
for i in range(n_observations):
# TODO: handle possible 0-divisions
observations[i,:] = observations[i,:] / observations[i,:].max()
if n_observations == expected_observations:
# Return (normalized) observations as they are & sequence of agents in order (i.e. no randomization)
order_agents = np.arange(n_observations)
return observations, order_agents
if n_observations < expected_observations:
# Return padded observations as they are & padded sequence of agents in order (i.e. no randomization)
padded_observations = np.zeros([expected_observations]+observation_size)
padded_observations[0:n_observations,:] = observations
order_agents = list(range(n_observations))+[-1]*(expected_observations-n_observations) # -1 == agent absent
return padded_observations, order_agents
if n_observations > expected_observations:
# Return random selection of observations in random order
order_agents = np.random.choice(range(n_observations), size=expected_observations, replace=False)
selected_observations = np.zeros([expected_observations] + observation_size)
for i_selected, i_given_observations in enumerate(order_agents):
selected_observations[i_selected,:] = observations[i_given_observations,:]
return selected_observations, order_agents
# Example usage
n_observations = 5 # Number of actual observations
width = height = 2 # Observation dimension
state = np.random.random(size=[n_observations,height,width]) # Random state
print(state)
print(get_overall_observation(state))
I solve the problem using different solutions but I found that the encoding is the best solution for my problem我使用不同的解决方案解决了问题,但我发现编码是我问题的最佳解决方案
[1]
mentioned that the extra connected autonomous vehicles (CAVs) are not included in the state and if they are less than the max CAVs, the state is padded with zeros.正如论文[1]
提到的,额外连接的自动驾驶汽车 (CAV) 不包括在状态中,如果它们小于最大 CAV,则状态用零填充。 We can select how many agents that we can share their state adding to the agent's state.我们可以选择可以共享他们的状态的代理数量,并将其添加到代理的状态中。For the encoder, I use the Neural machine translation with attention code对于编码器,我使用带有注意力代码的神经机器翻译
class Encoder(tf.keras.Model):
def __init__(self, vocab_size, embedding_dim, enc_units, batch_sz):
super(Encoder, self).__init__()
self.batch_sz = batch_sz
self.enc_units = enc_units
self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
self.gru = tf.keras.layers.GRU(self.enc_units,
return_sequences=True,
return_state=True,
recurrent_initializer='glorot_uniform')
def call(self, x, hidden):
x = self.embedding(x)
output, state = self.gru(x, initial_state = hidden)
return output, state
def initialize_hidden_state(self):
return tf.zeros((self.batch_sz, self.enc_units))
1- Vinitsky, E., Kreidieh, A., Le Flem, L., Kheterpal, N., Jang, K., Wu, C., ... & Bayen, AM (2018, October). 1- Vinitsky, E., Kreidieh, A., Le Flem, L., Kheterpal, N., Jang, K., Wu, C., ... & Bayen, AM(2018 年 10 月)。 Benchmarks for reinforcement learning in mixed-autonomy traffic.混合自主交通中强化学习的基准。 In Conference on Robot Learning (pp. 399-409)在机器人学习会议上(第 399-409 页)
2- Kochkina, E., Liakata, M., & Augenstein, I. (2017). 2- Kochkina, E., Liakata, M., & Augenstein, I. (2017)。 Turing at semeval-2017 task 8: Sequential approach to rumour stance classification with branch-lstm.图灵在 semeval-2017 任务 8:使用 branch-lstm 进行谣言立场分类的顺序方法。 arXiv preprint arXiv:1704.07221. arXiv 预印本 arXiv:1704.07221。
3- Ma, L., & Liang, L. (2020). 3- Ma, L., & Liang, L. (2020)。 Enhance CNN Robustness Against Noises for Classification of 12-Lead ECG with Variable Length.增强 CNN 对噪声的鲁棒性,用于对可变长度的 12 导联心电图进行分类。 arXiv preprint arXiv:2008.03609. arXiv 预印本 arXiv:2008.03609。
4- How to feed LSTM with different input array sizes? 4-如何为 LSTM 提供不同的输入数组大小?
5- Zhao, X., Xia, L., Zhang, L., Ding, Z., Yin, D., & Tang, J. (2018, September). 5- Zhao, X., Xia, L., Zhang, L., Ding, Z., Yin, D., & Tang, J.(2018 年 9 月)。 Deep reinforcement learning for page-wise recommendations.用于逐页推荐的深度强化学习。 In Proceedings of the 12th ACM Conference on Recommender Systems (pp. 95-103).第 12 届 ACM 推荐系统会议论文集(第 95-103 页)。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.