简体   繁体   中英

What is the best approach to tackle this kind of DQN Agent?

I am a beginner at Reinforcement Learning and Deep Learning, so bare with me ^^

Let's say we have a DQN agent in Keras that receives an input that is a 2D matrix of 0s and 1s, let's say it has 10 rows and 3 columns.

This matrix is a matrix of requests of 10 users (number of rows), if one of the columns' value is equal to 1, that means the user is asking the agent for a resource to be given to that user.

Example:

[
 [0, 1, 0],
 [0, 0, 0],
 [1, 0, 0],
 [0, 0, 1],
 ...
]

Upon receiving the input matrix, the agent must give a resource for the users that asked for it, and nothing for the users who didn't.

Let's say the agent has 12 resources that it can allocate. We can present the resource allocation as a 2D matrix that has 12 rows (number of resources) and 10 columns (number of users).

Each resource can be given to one user only and each user can use one resource only in each step.

I have tried this which is a similar problem to mine, but when I run the code, the q_values (or weights ?) are assigned to each column of each row of the output matrix, and what I want is the q_values to be assigned to the matrix as a whole, or at least that's what my beginner brain told me to do.

The action (output) matrix can be like this:

[
 [1, 0, 0, 0, 0, ...]
 [0, 0, 0, 0, 0, ...],
 [0, 0, 0, 1, 0, ...],
 ...
]

One idea I had is to choose from a collection of matrices (actions), but the collection is very large and I cannot store it because it gives me a MemoryError.

I am still confused as to what the best approach to solve this dilemma is.

The simplest way to do this would be to define your DQN agent with a n_users-dimensional action vector. Each entry of this action vector should be an integer x in [-1, n_resources) . x == -1 means no resource assigned to this user, whilst 0 <= x < n_resources means xth resource assigned to this user. Your example action output would therefore be represented as:

[0, -1, 3, ...]

If the agent tries to assign the same resource to two agents you mark that as an illegal action. The problem with this is that your space of illegal actions is huge (factorial in the number of users).

Another approach would be to completely change your problem architecture and have an that assigns resources to the people one at a time. The agent would obviosuly need some sort of memory for the resources it has allocated. This way your action and illegal action structures are much simpler. In this scenario an episode would consist of n_users time-steps where at each time-step the agent interacts with the environment and sees the request of the current user and the resources it has already allocated.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM