简体   繁体   English

Python 检查列表中是否存在用于检测周期的列表

[英]Python check if list is present in list of lists for detecting cycles

I have a sample of about 60000 data points and in an iterative algorithm, in each step, depending on some criterion, I either 'remove' (set to NaN) one of those data points or 'add' one of the previously removed data points back into the sample (set back to its original value).我有一个大约 60000 个数据点的样本,并且在迭代算法中,在每个步骤中,根据某些标准,我要么“删除”(设置为 NaN)其中一个数据点,要么“添加”一个先前删除的数据点回到样本中(设置回其原始值)。 In order to avoid that the algorithm falls into an infinite loop, the sample should always be different in each iteration.为了避免算法陷入死循环,每次迭代的样本应该总是不同的。 I therefore keep track of the data points that are currently removed in each iteration and store the element indices in a list as follows:因此,我跟踪当前在每次迭代中删除的数据点,并将元素索引存储在列表中,如下所示:

  • Iteration 1: data_state_list = [[2]] (element with array index 2 is removed)迭代 1: data_state_list = [[2]](删除数组索引为 2 的元素)
  • Iteration 2: data_state_list = [[2],[2,3]] (element with array index 3 is removed)迭代 2: data_state_list = [[2],[2,3]](删除数组索引为 3 的元素)
  • Iteration 3: data_state_list = [[2],[2,3],[2,3,1]]迭代 3: data_state_list = [[2],[2,3],[2,3,1]]
  • Iteration 4: data_state_list = [[2],[2,3],[2,3,1],[2,1]] (element with array index 3 is re-added)迭代 4: data_state_list = [[2],[2,3],[2,3,1],[2,1]](重新添加数组索引为 3 的元素)
  • Iteration 5: data_state_list = [[2],[2,3],[2,3,1],[2,1],[2,1,4]]迭代 5: data_state_list = [[2],[2,3],[2,3,1],[2,1],[2,1,4]]
  • Iteration 6: data_state_list = [[2],[2,3],[2,3,1],[2,1],[2,1,4],[2,1,4,3]]迭代 6: data_state_list = [[2],[2,3],[2,3,1],[2,1],[2,1,4],[2,1,4,3]]

Now in the current iteration 7, the algorithm suggests removing the element with array index 4, thus the new state data_state_temp would thus be [2,1,3].现在在当前迭代 7 中,算法建议删除数组索引为 4 的元素,因此新的 state data_state_temp将是 [2,1,3]。 Currently it checks whether it has seen the state so far via目前,它通过

flag_cycle = (data_state_temp in data_state_list)

The algorithm checks the new state for adding/removing different array elements until flag_cycle is False , then proceeds.该算法检查新的 state 以添加/删除不同的数组元素,直到flag_cycleFalse ,然后继续。

Apart from that it does not yet fully work, because the states [2,1,3] from iteration 7 and [2,3,1] from iteration 3 are the same, but the lists are not (need to sort them or better insert newly removed array elements into where they would belong in a sorted list), the problem is that the algorithm becomes very slow.除此之外,它还没有完全工作,因为迭代 7 中的状态 [2,1,3] 和迭代 3 中的 [2,3,1] 是相同的,但列表不同(需要对它们进行排序或更好将新删除的数组元素插入到它们应该属于排序列表的位置),问题是算法变得非常慢。 In practice, eg data_state_temp has length 15000, and data_state_list has 40000 lists with generally increasing length up to 15000.在实践中,例如data_state_temp的长度为 15000,而data_state_list有 40000 个列表,通常长度会增加到 15000。

Questions:问题:

  • How can we get a speed-up for that cycle/infinite loop check?我们如何才能加快循环/无限循环检查的速度? Other/Conceptually different approaches to checking whether we already had the same state previously are perfectly fine.检查我们之前是否已经拥有相同的 state 的其他/概念上不同的方法非常好。
  • In the current code, when Python checks whether data_state_temp is in data_state_list , does it compare list elements only for lists with lengths matching that of data_state_temp (I would expect that) or would we need to manually select these lists beforehand?在当前代码中,当 Python 检查data_state_temp是否在data_state_list中时,它是否仅比较长度与data_state_temp匹配的列表的列表元素(我希望如此),还是我们需要事先手动 select 这些列表?

Keep History as Set (Fixed Lookup)保留历史记录(固定查找)

If you do not care about the order of past states -- just that it was ever visited before -- then lets amass a set of all past states.如果你不关心过去状态的顺序——只是它曾经被访问过——那么让我们收集一set所有过去的状态。 This gives us a fixed lookup instead of linear lookup with in (eg for needle in haystack )这给了我们一个固定的查找而不是线性查找in (例如for needle in haystack

For a set to work its magic, it has to work with hashable types.要使set发挥其魔力,它必须与可散列类型一起使用。 In short, a tuple is immutable and therefore hashable and therefore a shoe-in candidate to be used with set .简而言之, tuple是不可变的,因此是可散列的,因此可以与set一起使用。

from itertools import chain

# let's say we already have state that is 15000 long
data_states = set()
data_state = tuple(range(15000))
data_states.add(data_state)

# we loop until we decide on a new state
while data_state in data_states:
    # either you decide to remove an element, say 42
    # (can skip sort because the previous tuple was already sorted)
    new_data_state = tuple(x for x in data_state if x != 42)

    # or you decide to add an element, 9000
    new_data_state = tuple(sorted(x for x in chain(data_state, [9000])))

    data_state = new_data_state

# now commit this state to your history
data_states.add(new_data_state)

Note that tuple is immutable, so we have to sort first before creating the tuple.请注意,元组是不可变的,因此我们必须在创建元组之前先进行排序。 Also note the sorted(_) form creates a copy, whereas _.sort() performs in-place sort.还要注意sorted(_)形式创建一个副本,而_.sort()执行就地排序。 Since we are parsing previous state as a generator, the latter sort would not be possible as we do not have memory.由于我们将之前的 state 解析为生成器,因此后一种排序是不可能的,因为我们没有 memory。 So former is picked.所以选择前者。 Then that is fed into tuple() constructor.然后将其输入tuple()构造函数。

Runtime Complexity运行时复杂性

Given len(state) = n :给定len(state) = n

new_data_state = tuple(sorted(x for x in chain(data_state, [9000])))
  1. x for x in... : generator with n operations x for x in... :具有n操作的生成器
  2. sorted(...) : creates a list, populates with n entries sorted(...) : 创建一个列表,填充n个条目
  3. tuple(...) : creates a tuple, populates n entries tuple(...) : 创建一个元组,填充n个条目

Steps 1+2 happen in one pass of n .步骤 1+2 发生在一次n中。 Then repeat n for Step 3. Total runtime complexity is O{2n} per iteration of state .然后对步骤 3 重复n 。每次state迭代的总运行时复杂度为O{2n}

Q: How to speed up?问:如何加速?

What I listed above.我上面列出的。 If you become strapped for memoery, you might consider more compact forms of representing state .如果您记忆犹新,您可能会考虑更紧凑的 forms 代表state

For example, you might chunk state into 3 chunks consisting of 5000 consecutive numbers.例如,您可以将state分块为由5000个连续数字组成的3块。 Then use a nested dictionary.然后使用嵌套字典。 Eg data_states[(x for x in range(5000)][(x for x in range(5000)] == <all the remaining digits>例如data_states[(x for x in range(5000)][(x for x in range(5000)] == <all the remaining digits>

In this way, the space cost is amortized such that the incremental space cost for an element with 15777 digits is only 15777 - 5000 - 5000 = 5777 digits.通过这种方式,空间成本被摊销,使得15777位元素的增量空间成本仅为15777 - 5000 - 5000 = 5777位。 This is essentially what a dictionary does.这本质上就是字典的作用。

Q: Does it compare list elements by length?问:它是否按长度比较列表元素?

Yes, it performs length matching as the first check to short-circuit a False .是的,它执行长度匹配作为短路False的第一个检查。 But no, it still processes every single element.但是不,它仍然处理每一个元素。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM