简体   繁体   English

在python中删除多个列表中的几个项目的最有效方法?

[英]Most efficient way to remove several items in several lists in python?

I have several lists of items. 我有几个项目列表。 There are no duplicates, each item appears at most once per list (and normally, only once in all lists). 没有重复项,每个项目最多出现一次(通常,在所有列表中只出现一次)。 I also have a list of items to remove from this dataset. 我还有一个要从此数据集中删除的项目列表。 How can it be done in the cleanest and most efficient way? 如何以最干净,最有效的方式完成?

I have read that in python, creating a new object is often simplier and faster than filtering an existant one. 我已经读过,在python中,创建一个新对象通常比过滤一个对象更简单,更快。 But I do not observe that in my basic tests : 但我在基本测试中没有观察到:

data = [[i*j for j in range(1, 1000)] for i in range(1, 1000)]
kill = [1456, 1368, 2200, 36, 850, 9585, 59588, 60325, 9520, 9592, 210, 3]

# Method 1 : 0.1990 seconds
for j in kill:
    for i in data:
        if j in i:
            i.remove(j)

# Method 2 : 0.1920 seconds
for i in data:
    for j in kill:
        if j in i:
            i.remove(j)

# Method 3 : 0.2790 seconds
data = [[j for j in i if j not in kill] for i in data]

Which method is the best to use in Python ? 哪种方法最适合在Python中使用?

https://wiki.python.org/moin/TimeComplexity https://wiki.python.org/moin/TimeComplexity

remove is O(n) because it first searches linearly through the list and then, if it finds it, every element after the removed object gets shifted one position to the left in memory. remove是O(n),因为它首先在列表中线性搜索,然后,如果找到它,则删除的对象之后的每个元素都会在内存中向左移动一个位置。 Because of this remove is quite an expensive operation. 因此, remove是相当昂贵的操作。

Hence remove M items from a list of length N be comes O(N*M) 因此,从长度为N的列表中删除M项来得O(N*M)

in on lists is also O(n) because we need to search through the whole list in order. in列表中也是O(n)因为我们需要按顺序搜索整个列表。 Hence building a new list with a filter is also O(N*M) . 因此,使用过滤器构建新列表也是O(N*M) However, in on sets is O(1) due to hashing making our filter O(N) 但是,由于散列使得我们的滤波器O(N)in on set上是O(1) O(N)

Hence the best solution is (I'm just going to use a flat list for simplicity, not nested) 因此,最好的解决方案是(我只是为了简单而使用平面列表,而不是嵌套)

def remove_kill_from_data(data, kill):
    s = set(kill)
    return [i for i in data if i not in kill]

If you don't care about keeping the order, this would be even faster (due to being done at the C level, it's still O(N) ) 如果你不关心保持秩序,这将更快(由于在C级完成,它仍然是O(N)

def remove_kill_from_data_unordered(data, kill):
    s = set(kill)
    d = set(data)
    return d - s

Applying to your list of lists 应用于您的列表列表

kill_set = set(kill)
[remove_kill_from_data(d, kill_set) for d in data]

Some timings (each copies from a static data first) 一些时间(每次从静态data复制)

%timeit method1(data, kill)
210 ms ± 769 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit method2(data, kill)
208 ms ± 2.89 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit method3(data, kill)
272 ms ± 1.28 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit method4(data, kill)  # using remove_kill_from_data
69.6 ms ± 1.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit method5(data, kill) # using remove_kill_from_data_unordered
59.5 ms ± 3.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

There is no “best way to remove from a list in Python”. 没有“从Python中删除列表的最佳方法”。 If there were, Python would have only one way to do it. 如果有的话,Python只有一种方法可以做到。 There are different best ways for different problems, which is why Python has different ways to do it. 对于不同的问题有不同的最佳方法,这就是为什么Python有不同的方法来做到这一点。


Correctness is far more important than speed. 正确性远比速度重要。 Getting the wrong answer quickly is useless. 快速得到错误的答案是没用的。 (Otherwise, the fastest solution is to just do nothing at all.) And your first two implementations have two problems. (否则,最快的解决方案就是什么都不做。)而前两个实现有两个问题。

First, you use remove to find and remove the element by value. 首先,使用remove来按值查找和删除元素。 Besides being wasteful (you just searched the whole list to find the element, and now you're searching it again to find and remove it), that doesn't do the right thing if there are any duplicates—only the first one will get removed. 除了浪费(你只是搜索整个列表以找到元素,现在你再次搜索它以找到并删除它),如果有任何重复,这是不正确的 - 只有第一个会得到除去。 And if there aren't any duplicates, you probably should be using a set (or an OrderedSet, if there aren't duplicates but order does matter), which would let you write this both simpler and immensely faster. 如果没有任何重复项,你可能应该使用一个集合(或者一个OrderedSet,如果没有重复但是顺序很重要),这会让你写得更简单,更快。

Second, you're removing from a list while iterating it. 其次,您在迭代时从列表中删除。 This causes you to miss elements. 这会导致您错过元素。 If you delete element 2, that moves all of the subsequent elements up—so the original element 3 is now element 2, but your next time through the loop is checking element 3. So, if you have two killables in a row, the second one will be missed. 如果删除元素2,则会将所有后续元素向上移动 - 因此原始元素3现在是元素2,但是下一次循环检查元素3.因此,如果连续有两个可用元素,则第二个一个人会被遗漏。 You can solve this by iterating in reverse, but it makes things more complicated. 你可以通过反向迭代来解决这个问题,但它会让事情变得更复杂。 Or you can iterate a copy while modifying the original, but that makes things more complicated and costs time and space for the copy. 或者您可以在修改原始文件时迭代副本,但这会使事情变得更加复杂并且会花费复制的时间和空间。

Both of these problems can be fixed, but this raises an important point: the first two versions are much easier to get subtly wrong, as proven by the fact that you got them wrong and didn't even notice it. 这两个问题都可以解决,但这提出了一个重要的观点:前两个版本更容易出现微妙的错误,事实证明你错了,甚至没有注意到它。

And of course fixing these problems may well make the first two versions a bit slower instead of a bit faster. 当然,修复这些问题可能会使前两个版本慢一点而不是更快一点。


Even if you fix these problems, mutating an object doesn't do the same thing as making a new object. 即使你解决了这些问题,改变一个对象也不会像创建一个新对象那样做。 If someone else has a reference to the same list, they will see the changes with the first two versions, but they'll keep the list they expected with the last version. 如果其他人对同一个列表有引用,他们会看到前两个版本的更改,但是他们会保留他们在上一个版本中所期望的列表。 If that someone else is code on another thread that might be iterating the list at the same time you're working on it, things get even more complicated. 如果其他人是另一个可能在您正在处理它的同时迭代列表的线程上的代码,那么事情变得更加复杂。 Sometimes you want the first behavior, sometimes the second. 有时您想要第一个行为,有时候是第二个行为。 You can add more complexity onto either version to get the opposite effect (eg, assigning a comprehension to a slice of the whole list, instead of just rebinding the name), but usually it's simpler to directly write the one that you want. 您可以在任一版本上添加更多复杂性以获得相反的效果(例如,将理解分配给整个列表的一部分,而不是仅仅重新绑定名称),但通常直接编写您想要的那个更简单。


Plus, the comprehension version can be trivially changed to an iterative version that only does work on demand (just change one or both sets of brackets to parentheses). 此外,理解版本可以简单地更改为仅按需工作的迭代版本(只需将一组或两组括号更改为括号)。 And it works on any iterable, not just lists. 它适用于任何可迭代的,而不仅仅是列表。 You can often get a huge performance benefit and/or simplification at a higher level by rewriting your algorithm as a chain of iterator transformations so you never need the whole data set in memory. 通过将算法重写为迭代器转换链,您通常可以在更高级别获得巨大的性能优势和/或简化,因此您永远不需要内存中的整个数据集。 But other times, you can get a huge performance or simplicity benefit from multiple passes, or random-access patterns, so a list is much better. 但有时候,您可以通过多次传递或随机访问模式获得巨大的性能或简单优势,因此列表要好得多。 And that will determine which implementation you want for this piece of code. 这将决定您对这段代码的实现。


There's also a space difference. 还有一个空间差异。 The comprehension takes linear temporary space instead of constant, but on the other hand it can leave you with a smaller final result in memory because of the way Python grows and shrinks lists. 理解采用线性临时空间而不是常量,但另一方面,由于Python增长和缩小列表的方式,它可以在内存中留下较小的最终结果。 (If this matters, you need to test it—the language doesn't even guarantee that lists shrink their storage at all; how they do so is up to each implementation.) (如果这很重要,您需要对其进行测试 - 该语言甚至不能保证列表会缩小其存储空间;他们如何这样做取决于每个实现。)


Finally, we're talking about a pretty small difference. 最后,我们谈论的是一个相当小的差异。 If this matters in your code, the fact that you're ignoring other options that could give a much larger improvement probably matters a bit more. 如果这在您的代码中很重要,那么您忽略了可以提供更大改进的其他选项这一事实可能会更重要。 If you can use a list of sets instead of a list of lists, that difference will be huge. 如果您可以使用集合列表而不是列表列表,那么差异将是巨大的。 If you can't, at least making kill a set speeds things up, and you can definitely do that. 如果不能,至少让kill一组加速东西,你一定能做到这一点。 Using numpy might give an order of magnitude improvement. 使用numpy可能会提高一个数量级。 Just running the existing code in PyPy instead of CPython might speed it up almost as much as numpy for a lot less work. 只运行PyPy中的现有代码而不是CPython可能会使得它的速度几乎和numpy一样多,而工作量却少得多。 Or you might want to write a C extension for your inner loop (which qcould just be a matter of putting the same code in a .pyx file and Cythonizing it). 或者您可能想为内部循环编写一个C扩展(这可能只是将相同的代码放在.pyx文件中并对其进行Cython化)。 If none of those things seems worth the effort for an order of magnitude or better improvement, why is it worth putting the time you've already put into this for a 50% improvement? 如果这些事情似乎都不值得为一个数量级或更好的改进付出努力,为什么值得把你已经投入的时间用于50%的改进?


Putting some actual numbers to this: 把一些实际数字加到这个:

  • Method 1: 140ms 方法1:140ms
  • Corrected method 1: 193ms 修正方法1:193ms
  • Method 3: 190ms 方法3:190ms
  • Method 3 in PyPy: 21.6ms PyPy中的方法3:21.6ms
  • [i - kill for i in data] where data is a list of sets and kill is a set: 20.6ms [i - kill for i in data]其中data是集合列表, kill是一组:20.6ms
  • data[~np.isin(data, kill)] where data is a np.array : 26.6ms data[~np.isin(data, kill)]其中datanp.array :26.6ms

(I also tried the same tests in Python 2.7; method 3 is about 30% slower, and method 4 about 15% slower, while the others are almost identical.) (我也在Python 2.7中尝试了相同的测试;方法3慢了大约30%,方法4大约慢了15%,而其他方法几乎相同。)


As a side note, you didn't show us how you tested this code, and the tests are also easy to get subtly wrong. 作为旁注,您没有向我们展示您如何测试此代码,并且测试很容易出错。 Even if you used timeit , you still need to make sure you're running against the original list each time, not repeating the code against the same already-filtered list (which would mean the first rep is testing the right case, and the other 99999 reps are testing a different case where there are no killables). 即使你使用了timeit ,你仍然需要确保每次都对原始列表运行,而不是针对同一个已经过滤的列表重复代码(这意味着第一个代表正在测试正确的情况,另一个代表99999名代表正在测试一个不存在任何可杀戮的情况。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM