简体   繁体   English

解决这种 DQN 代理的最佳方法是什么?

[英]What is the best approach to tackle this kind of DQN Agent?

I am a beginner at Reinforcement Learning and Deep Learning, so bare with me ^^我是强化学习和深度学习的初学者,所以裸露我^^

Let's say we have a DQN agent in Keras that receives an input that is a 2D matrix of 0s and 1s, let's say it has 10 rows and 3 columns.假设我们在 Keras 中有一个 DQN 代理,它接收一个由 0 和 1 组成的二维矩阵的输入,假设它有 10 行和 3 列。

This matrix is a matrix of requests of 10 users (number of rows), if one of the columns' value is equal to 1, that means the user is asking the agent for a resource to be given to that user.该矩阵是一个由 10 个用户(行数)的请求组成的矩阵,如果其中一列的值等于 1,则表示用户正在向代理请求提供给该用户的资源。

Example:例子:

[
 [0, 1, 0],
 [0, 0, 0],
 [1, 0, 0],
 [0, 0, 1],
 ...
]

Upon receiving the input matrix, the agent must give a resource for the users that asked for it, and nothing for the users who didn't.收到输入矩阵后,代理必须为请求它的用户提供资源,而不为没有请求的用户提供任何资源。

Let's say the agent has 12 resources that it can allocate.假设代理有 12 个可以分配的资源。 We can present the resource allocation as a 2D matrix that has 12 rows (number of resources) and 10 columns (number of users).我们可以将资源分配表示为具有 12 行(资源数量)和 10 列(用户数量)的 2D 矩阵。

Each resource can be given to one user only and each user can use one resource only in each step.每个资源只能分配给一个用户,每个用户在每一步只能使用一个资源。

I have tried this which is a similar problem to mine, but when I run the code, the q_values (or weights ?) are assigned to each column of each row of the output matrix, and what I want is the q_values to be assigned to the matrix as a whole, or at least that's what my beginner brain told me to do.我已经尝试过与我的问题类似,但是当我运行代码时,将 q_values(或权重?)分配给输出矩阵每一行的每一列,我想要的是 q_values 分配给整个矩阵,或者至少这是我的初学者大脑告诉我要做的。

The action (output) matrix can be like this:动作(输出)矩阵可以是这样的:

[
 [1, 0, 0, 0, 0, ...]
 [0, 0, 0, 0, 0, ...],
 [0, 0, 0, 1, 0, ...],
 ...
]

One idea I had is to choose from a collection of matrices (actions), but the collection is very large and I cannot store it because it gives me a MemoryError.我的一个想法是从一组矩阵(动作)中进行选择,但该集合非常大,我无法存储它,因为它给了我一个 MemoryError。

I am still confused as to what the best approach to solve this dilemma is.对于解决这个困境的最佳方法是什么,我仍然感到困惑。

The simplest way to do this would be to define your DQN agent with a n_users-dimensional action vector.最简单的方法是使用 n_users 维动作向量定义 DQN 代理。 Each entry of this action vector should be an integer x in [-1, n_resources) .此动作向量的每个条目都应该是x in [-1, n_resources)的整数x in [-1, n_resources) x == -1 means no resource assigned to this user, whilst 0 <= x < n_resources means xth resource assigned to this user. x == -1表示没有分配给该用户的资源,而0 <= x < n_resources表示分配给该用户的第 x 个资源。 Your example action output would therefore be represented as:因此,您的示例操作输出将表示为:

[0, -1, 3, ...]

If the agent tries to assign the same resource to two agents you mark that as an illegal action.如果代理尝试将相同的资源分配给两个代理,您将其标记为非法操作。 The problem with this is that your space of illegal actions is huge (factorial in the number of users).这样做的问题是你的非法行为空间很大(用户数量的因素)。

Another approach would be to completely change your problem architecture and have an that assigns resources to the people one at a time.另一种方法是彻底改变您的问题架构,并一次为人们分配资源。 The agent would obviosuly need some sort of memory for the resources it has allocated.代理显然需要某种内存来分配它所分配的资源。 This way your action and illegal action structures are much simpler.这样你的行动和非法行动结构就简单得多。 In this scenario an episode would consist of n_users time-steps where at each time-step the agent interacts with the environment and sees the request of the current user and the resources it has already allocated.在这种情况下,一个情节将由 n_users 时间步组成,其中在每个时间步,代理与环境交互并查看当前用户的请求及其已分配的资源。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM