简体   繁体   English

从一个Python列表中删除重复项,并基于此修剪其他列表

[英]Remove duplicates from one Python list, prune other lists based on it

I have a problem that's easy enough to do in an ugly way, but I'm wondering if there's a more Pythonic way of doing it. 我有一个很容易用丑陋的方式解决的问题,但是我想知道是否还有一种更Python化的方式可以做到。

Say I have three lists, A , B and C . 假设我有三个列表, ABC

A = [1, 1, 2, 3, 4, 4, 5, 5, 3]
B = [1, 2, 3, 4, 5, 6, 7, 8, 9]
C = [1, 2, 3, 4, 5, 6, 7, 8, 9]
# The actual data isn't important.

I need to remove all duplicates from list A , but when a duplicate entry is deleted, I would like the corresponding indexes removed from B and C : 我需要从列表A删除所有重复项,但是当删除重复项时,我希望从BC删除相应的索引:

A = [1, 2, 3, 4, 5]
B = [1, 3, 4, 5, 7]
C = [1, 3, 4, 5, 7]

This is easy enough to do with longer code by moving everything to new lists: 通过将所有内容移至新列表,可以轻松完成较长的代码:

new_A = []
new_B = []
new_C = []
for i in range(len(A)):
  if A[i] not in new_A:
    new_A.append(A[i])
    new_B.append(B[i])
    new_C.append(C[i])

But is there a more elegant and efficient (and less repetitive) way of doing this? 但是,是否有一种更优雅,更有效(且重复性更低)的方法呢? This could get cumbersome if the number of lists grows, which it might. 如果列表数量增加,可能会变得很麻烦。

Zip the three lists together, uniquify based on the first element, then unzip: 拉链的三个列表,uniquify基于第一个元素,然后解压:

from operator import itemgetter
from more_itertools import unique_everseen

abc = zip(a, b, c)
abc_unique = unique_everseen(abc, key=itemgetter(0))
a, b, c = zip(*abc_unique)

This is a very common pattern. 这是非常常见的模式。 Whenever you want to do anything in lock step over a bunch of lists (or other iterables), you zip them together and loop over the result. 每当您想在锁定列表上执行任何操作时,都将它们压缩在一起并遍历结果。

Also, if you go from 3 lists to 42 of them ("This could get cumbersome if the number of lists grows, which it might."), this is trivial to extend: 另外,如果您从3个列表增加到42个列表(“如果列表数量增加,可能会很麻烦,这可能会麻烦。”),扩展这个范围很简单:

abc = zip(*list_of_lists)
abc_unique = unique_everseen(abc, key=itemgetter(0))
list_of_lists = zip(*abc_unique)

Once you get the hang of zip , the "uniquify" is the only hard part, so let me explain it. 一旦掌握了zip的窍门,“ uniquify”就是唯一困难的部分,所以让我解释一下。

Your existing code checks whether each element has been seen by searching for each one in new_A . 您现有的代码通过在new_A搜索每个元素来检查是否已看到每个元素。 Since new_A is a list, this means that if you have N elements, M of them unique, on average you're going to be doing M/2 comparisons for each of those N elements. 由于new_A是一个列表,因此这意味着,如果您有N个元素,其中M个元素是唯一的,则平均而言,您将对这N个元素中的每个元素进行M / 2比较。 Plug in some big numbers, and NM/2 gets pretty big—eg, 1 million values, a half of them unique, and you're doing 250 billion comparisons. 插入一些大数字,NM / 2就会变得非常大-例如,一百万个值,其中一半是唯一的,您正在进行2500亿个比较。

To avoid that quadratic time, you use a set . 为了避免二次时间,可以使用set A set can test an element for membership in constant, rather than linear, time. set可以在恒定时间内(而不是线性时间内)测试元素的隶属关系。 So, instead of 250 billion comparisons, that's 1 million hash lookups. 因此,不是2千5百万次比较,而是1百万次哈希查找。

If you don't need to maintain order or decorate-process-undecorate the values, just copy the list to a set and you're done. 如果您不需要保持顺序或装饰过程取消装饰它们的值,只需将列表复制到set就可以了。 If you need to decorate, you can use a dict instead of a set (with the key as the dict keys, and everything else hidden in the values). 如果需要装饰,则可以使用dict而不是集合(键作为dict键,而其他所有内容都隐藏在值中)。 To preserve order, you could use an OrderedDict , but at that point it's easier to just use a list and a set side by side. 要保留订单,可以使用OrderedDict ,但是到那时,仅将listset并排使用会更容易。 For example, the smallest change to your code that works is: 例如,对您的代码的最小更改是:

new_A_set = set()
new_A = []
new_B = []
new_C = []
for i in range(len(A)):
    if A[i] not in new_A_set:
        new_A_set.add(A[i])
        new_A.append(A[i])
        new_B.append(B[i])
        new_C.append(C[i])

But this can be generalized—and should be, especially if you're planning to expand from 3 lists to a whole lot of them. 但这可以被概括,并且应该如此,特别是如果您打算从3个列表扩展到很多列表。

The recipes in the itertools documentation include a function called unique_everseen that generalizes exactly what we want. itertools文档中配方包括一个称为unique_everseen的函数,该函数可以概括我们想要的内容。 You can copy and paste it into your code, write a simplified version yourself, or pip install more-itertools and use someone else's implementation (as I did above). 您可以将其复制并粘贴到您的代码中,自己编写一个简化版本,或者通过pip install more-itertools并使用其他人的实现(如我上面所做的那样)。


PadraicCunningham asks: PadraicCunningham问:

how efficient is zip(*unique_everseen(zip(a, b, c), key=itemgetter(0))) ? zip(*unique_everseen(zip(a, b, c), key=itemgetter(0)))效率如何?

If there are N elements, M unique, it's O(N) time and O(M) space. 如果有N个元素,M是唯一的,那就是O(N)时间和O(M)空间。

In fact, it's effectively doing the same work as the 10-line version above. 实际上,它实际上正在执行与上述10行版本相同的工作。 In both cases, the only work that's not obviously trivial inside the loop is key in seen and seen.add(key) , and since both operations are amortized constant time for set , that means the whole thing is O(N) time. 在这两种情况下,这不是明显的琐碎循环内的唯一工作就是key in seenseen.add(key) ,由于这两种操作都分期常量时间set ,这意味着整个事情是O(N)时间。 In practice, for N= 1000000, M=100000 the two versions are about 278ms and 297ms (I forget which is which) compared to minutes for the quadratic version. 实际上,对于N = 1000000, M=100000这两个版本分别为278ms和297ms(我忘记了是哪个),而二次版本的分钟数则为分钟。 You could probably micro-optimize that down to 250ms or so—but it's hard to imagine a case where you'd need that, but wouldn't benefit from running it in PyPy instead of CPython, or writing it in Cython or C, or numpy-izing it, or getting a faster computer, or parallelizing it. 您可能可以将其微优化到250ms左右,但是很难想象有一个需要的情况,但是在PyPy而不是CPython中运行它,或者在Cython或C中编写它并不会受益。 numpy化它,或者获得更快的计算机,或使其并行化。

As for space, the explicit version makes it pretty obvious. 至于空间,显式版本使其非常明显。 Like any conceivable non-mutating algorithm, we've got the three new_Foo lists around at the same time as the original lists, and we've also added new_A_set of the same size. 像任何可能的非变异算法一样,我们与原始列表同时存在三个new_Foo列表,并且还添加了相同大小的new_A_set Since all of those are length M , that's 4M space. 由于所有这些都是长度M ,因此是4M的空间。 We could cut that in half by doing one pass to get indices, then doing the same thing mu 無's answer does: 我们可以通过一遍获取索引将其减半,然后执行mu无答案的相同操作:

indices = set(zip(*unique_everseen(enumerate(a), key=itemgetter(1))[0])
a = [a[index] for index in indices]
b = [b[index] for index in indices]
c = [c[index] for index in indices]

But there's no way to go lower than that; 但是没有比这更低的方法了。 you have to have at least a set and a list of length M alive to uniquify a list of length N in linear time. 您必须至少有一个集合和一个长度为M的列表才能在线性时间内统一一个长度为N的列表。

If you really need to save space, you can mutate all three lists in-place. 如果您确实需要节省空间,则可以就地更改所有三个列表。 But this is a lot more complicated, and a bit slower (although still linear*). 但这要复杂得多,并且速度要慢一些(尽管仍然是线性*)。

Also, it's worth noting another advantage of the zip version: it works on any iterables. 另外,值得注意的是zip版本的另一个优点:它适用于所有可迭代对象。 You can feed it three lazy iterators, and it won't have to instantiate them eagerly. 您可以为它提供三个懒惰的迭代器,而不必急于实例化它们。 I don't think it's doable in 2M space, but it's not too hard in 3M: 我认为这在2M空间中并不可行,但在3M中并不难:

indices, a = zip(*unique_everseen(enumerate(a), key=itemgetter(1))
indices = set(indices)
b = [value for index, value in enumerate(b) if index in indices]
c = [value for index, value in enumerate(c) if index in indices]

* Note that just del c[i] will make it quadratic, because deleting from the middle of a list takes linear time. *请注意,仅del c[i]会使它成为二次方,因为从列表中间删除需要线性时间。 Fortunately, that linear time is a giant memmove that's orders of magnitude faster than the equivalent number of Python assignments, so if N isn't too big you can get away with it—in fact, at N=100000, M=10000 it's twice as fast as the immutable version… But if N might be too big, you have to instead replace each duplicate element with a sentinel, then loop over the list in a second pass so you can shift each element only once, which is instead 50% slower than the immutable version. 幸运的是,线性时间是一个巨大的记忆体,比同等数量的Python分配要快几个数量级,因此,如果N 不太大,您就可以摆脱它-实际上,在N=100000, M=10000它是两倍和不可变版本一样快...但是如果N可能太大,则必须用哨兵代替每个重复的元素,然后在第二遍中遍历列表,以便每个元素只能移动一次,即50%比不可变版本慢。

How about this - basically get a set of all unique elements of A, and then get their indices, and create a new list based on these indices. 怎么样-基本上获取A的所有唯一元素的集合,然后获取它们的索引,并基于这些索引创建一个新列表。

new_A = list(set(A))
indices_to_copy = [A.index(element) for element in new_A]
new_B = [B[index] for index in indices_to_copy]
new_C = [C[index] for index in indices_to_copy]

You can write a function for the second statement, for reuse: 您可以为第二条语句编写一个函数,以供重用:

def get_new_list(original_list, indices):
    return [original_list[idx] for idx in indices]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM