[英]Python check if list is present in list of lists for detecting cycles
I have a sample of about 60000 data points and in an iterative algorithm, in each step, depending on some criterion, I either 'remove' (set to NaN) one of those data points or 'add' one of the previously removed data points back into the sample (set back to its original value).我有一个大约 60000 个数据点的样本,并且在迭代算法中,在每个步骤中,根据某些标准,我要么“删除”(设置为 NaN)其中一个数据点,要么“添加”一个先前删除的数据点回到样本中(设置回其原始值)。 In order to avoid that the algorithm falls into an infinite loop, the sample should always be different in each iteration.
为了避免算法陷入死循环,每次迭代的样本应该总是不同的。 I therefore keep track of the data points that are currently removed in each iteration and store the element indices in a list as follows:
因此,我跟踪当前在每次迭代中删除的数据点,并将元素索引存储在列表中,如下所示:
Now in the current iteration 7, the algorithm suggests removing the element with array index 4, thus the new state data_state_temp would thus be [2,1,3].现在在当前迭代 7 中,算法建议删除数组索引为 4 的元素,因此新的 state data_state_temp将是 [2,1,3]。 Currently it checks whether it has seen the state so far via
目前,它通过
flag_cycle = (data_state_temp in data_state_list)
The algorithm checks the new state for adding/removing different array elements until flag_cycle
is False
, then proceeds.该算法检查新的 state 以添加/删除不同的数组元素,直到
flag_cycle
为False
,然后继续。
Apart from that it does not yet fully work, because the states [2,1,3] from iteration 7 and [2,3,1] from iteration 3 are the same, but the lists are not (need to sort them or better insert newly removed array elements into where they would belong in a sorted list), the problem is that the algorithm becomes very slow.除此之外,它还没有完全工作,因为迭代 7 中的状态 [2,1,3] 和迭代 3 中的 [2,3,1] 是相同的,但列表不同(需要对它们进行排序或更好将新删除的数组元素插入到它们应该属于排序列表的位置),问题是算法变得非常慢。 In practice, eg
data_state_temp
has length 15000, and data_state_list
has 40000 lists with generally increasing length up to 15000.在实践中,例如
data_state_temp
的长度为 15000,而data_state_list
有 40000 个列表,通常长度会增加到 15000。
Questions:问题:
data_state_temp
is in data_state_list
, does it compare list elements only for lists with lengths matching that of data_state_temp
(I would expect that) or would we need to manually select these lists beforehand?data_state_temp
是否在data_state_list
中时,它是否仅比较长度与data_state_temp
匹配的列表的列表元素(我希望如此),还是我们需要事先手动 select 这些列表? If you do not care about the order of past states -- just that it was ever visited before -- then lets amass a set
of all past states.如果你不关心过去状态的顺序——只是它曾经被访问过——那么让我们收集一
set
所有过去的状态。 This gives us a fixed lookup instead of linear lookup with in
(eg for needle in haystack
)这给了我们一个固定的查找而不是线性查找
in
(例如for needle in haystack
)
For a set
to work its magic, it has to work with hashable types.要使
set
发挥其魔力,它必须与可散列类型一起使用。 In short, a tuple
is immutable and therefore hashable and therefore a shoe-in candidate to be used with set
.简而言之,
tuple
是不可变的,因此是可散列的,因此可以与set
一起使用。
from itertools import chain
# let's say we already have state that is 15000 long
data_states = set()
data_state = tuple(range(15000))
data_states.add(data_state)
# we loop until we decide on a new state
while data_state in data_states:
# either you decide to remove an element, say 42
# (can skip sort because the previous tuple was already sorted)
new_data_state = tuple(x for x in data_state if x != 42)
# or you decide to add an element, 9000
new_data_state = tuple(sorted(x for x in chain(data_state, [9000])))
data_state = new_data_state
# now commit this state to your history
data_states.add(new_data_state)
Note that tuple is immutable, so we have to sort first before creating the tuple.请注意,元组是不可变的,因此我们必须在创建元组之前先进行排序。 Also note the
sorted(_)
form creates a copy, whereas _.sort()
performs in-place sort.还要注意
sorted(_)
形式创建一个副本,而_.sort()
执行就地排序。 Since we are parsing previous state as a generator, the latter sort would not be possible as we do not have memory.由于我们将之前的 state 解析为生成器,因此后一种排序是不可能的,因为我们没有 memory。 So former is picked.
所以选择前者。 Then that is fed into
tuple()
constructor.然后将其输入
tuple()
构造函数。
Given len(state) = n
:给定
len(state) = n
:
new_data_state = tuple(sorted(x for x in chain(data_state, [9000])))
x for x in...
: generator with n
operations x for x in...
:具有n
操作的生成器sorted(...)
: creates a list, populates with n
entries sorted(...)
: 创建一个列表,填充n
个条目tuple(...)
: creates a tuple, populates n
entries tuple(...)
: 创建一个元组,填充n
个条目Steps 1+2 happen in one pass of n
.步骤 1+2 发生在一次
n
中。 Then repeat n
for Step 3. Total runtime complexity is O{2n}
per iteration of state
.然后对步骤 3 重复
n
。每次state
迭代的总运行时复杂度为O{2n}
。
What I listed above.我上面列出的。 If you become strapped for memoery, you might consider more compact forms of representing
state
.如果您记忆犹新,您可能会考虑更紧凑的 forms 代表
state
。
For example, you might chunk state
into 3
chunks consisting of 5000
consecutive numbers.例如,您可以将
state
分块为由5000
个连续数字组成的3
块。 Then use a nested dictionary.然后使用嵌套字典。 Eg
data_states[(x for x in range(5000)][(x for x in range(5000)] == <all the remaining digits>
例如
data_states[(x for x in range(5000)][(x for x in range(5000)] == <all the remaining digits>
In this way, the space cost is amortized such that the incremental space cost for an element with 15777
digits is only 15777 - 5000 - 5000 = 5777
digits.通过这种方式,空间成本被摊销,使得
15777
位元素的增量空间成本仅为15777 - 5000 - 5000 = 5777
位。 This is essentially what a dictionary does.这本质上就是字典的作用。
Yes, it performs length matching as the first check to short-circuit a False
.是的,它执行长度匹配作为短路
False
的第一个检查。 But no, it still processes every single element.但是不,它仍然处理每一个元素。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.