嵌套列表作为状态和元组作为动作的 Q 表表示

Question

How can I create a Q-table, when my states are lists and actions are tuples?当我的状态是列表而动作是元组时，如何创建 Q 表？

Example of states for N = 3 N = 3 的状态示例

[[1], [2], [3]]
[[1], [2, 3]]
[[1], [3, 2]]
[[2], [3, 1]]
[[1, 2, 3]]

Example of actions for those states这些状态的动作示例

[[1], [2], [3]] -> (1, 2), (1, 3), (2, 1), (2, 3), (3, 1), (3, 2)
[[1], [2, 3]] -> (1, 2), (2, 0), (2, 1)
[[1], [3, 2]] -> (1, 3), (3, 0), (3, 1)
[[2], [3, 1]] -> (2, 3), (3, 0), (3, 2)
[[1, 2, 3]] -> (1, 0)

I was wondering about我想知道

# q_table = {state: {action: q_value}}

But I don't think, thats a good design.但我不认为，那是一个好的设计。

Answer 1

1. Should your states really be of type list? 1. 你的状态真的应该是列表类型吗？

list is a mutable type. list是可变类型。 tuple is the equivalent immutable type. tuple是等效的不可变类型。 Do you mutate your states during learning?你在学习过程中改变你的状态吗？ I doubt it.我对此表示怀疑。

In any case if you use list , you cannot use it as a dictionary key (because it is mutable)无论如何，如果您使用list ，则不能将其用作字典键（因为它是可变的）

2. Otherwise this is a pretty good representation 2.否则这是一个很好的表现

In a reinforcement learning context, you'll want to在强化学习环境中，你会想要

get a specific value for Q获得 Q 的特定值
Look at the Q values for all possible actions in a specific state (to find the maximal Q)查看特定 state 中所有可能动作的 Q 值（找到最大 Q）

Your representation allows you to do both of these with minimal complexity, and is pretty clear.您的表示允许您以最小的复杂性完成这两项操作，而且非常清楚。 So it is a good representation.所以这是一个很好的代表。

Answer 2

Using a nested dictionary is actually a reasonable design choice for custom tabular reinforcement learning---it's called tabular for a reason:)使用嵌套字典实际上是自定义表格强化学习的合理设计选择——它被称为表格是有原因的:)

You could use defaultdict to initialize the q-table to a certain value, eg, 0.您可以使用 defaultdict 将 q-table 初始化为某个值，例如 0。

from collections import defaultdict

q = defaultdict(lambda: defaultdict(lambda: default_q_value))

or without defaultdict:或者没有 defaultdict：

q = {s: {a: default_q_value for a in actions} for s in states}

It is then convenient to perform updates by getting the max by something like so然后通过类似这样的方式获取最大值来执行更新很方便

best_next_state_val = max(q[s].values())
q[state][action] += alpha * (reward + gamma * best_next_state_val)

One thing I'd just watch out for is that if you train an agent using a q-table like this, it will pick the same action each time if all the values for the actions are equal (such as when the qf is initialized).我要注意的一件事是，如果您使用这样的 q 表训练代理，如果所有操作的值都相等（例如初始化 qf 时），它每次都会选择相同的操作.

Finally, if you don't want to use dictionaries, you can just map state and action tuples to indices, store the mapping in a dictionary, and use a lookup when you pass the state/action to your environment implementation.最后，如果您不想使用字典，您可以只使用 map state 和操作元组到索引，将映射存储在字典中，并在将状态/操作传递给环境实现时使用查找。 You can then just use them as indices of a 2d numpy array.然后，您可以将它们用作 2d numpy 数组的索引。

嵌套列表作为状态和元组作为动作的 Q 表表示

问题描述

2 个解决方案

解决方案1
0 2022-04-05 01:31:30

1. Should your states really be of type list? 1. 你的状态真的应该是列表类型吗？

2. Otherwise this is a pretty good representation 2.否则这是一个很好的表现

解决方案2
0 2022-04-05 21:14:21

嵌套列表作为状态和元组作为动作的 Q 表表示

问题描述

2 个解决方案

解决方案1 0 2022-04-05 01:31:30

1. Should your states really be of type list? 1. 你的状态真的应该是列表类型吗？

2. Otherwise this is a pretty good representation 2.否则这是一个很好的表现

解决方案2 0 2022-04-05 21:14:21

解决方案1
0 2022-04-05 01:31:30

解决方案2
0 2022-04-05 21:14:21