简体   繁体   中英

Remove duplicates from one Python list, prune other lists based on it

I have a problem that's easy enough to do in an ugly way, but I'm wondering if there's a more Pythonic way of doing it.

Say I have three lists, A , B and C .

A = [1, 1, 2, 3, 4, 4, 5, 5, 3]
B = [1, 2, 3, 4, 5, 6, 7, 8, 9]
C = [1, 2, 3, 4, 5, 6, 7, 8, 9]
# The actual data isn't important.

I need to remove all duplicates from list A , but when a duplicate entry is deleted, I would like the corresponding indexes removed from B and C :

A = [1, 2, 3, 4, 5]
B = [1, 3, 4, 5, 7]
C = [1, 3, 4, 5, 7]

This is easy enough to do with longer code by moving everything to new lists:

new_A = []
new_B = []
new_C = []
for i in range(len(A)):
  if A[i] not in new_A:
    new_A.append(A[i])
    new_B.append(B[i])
    new_C.append(C[i])

But is there a more elegant and efficient (and less repetitive) way of doing this? This could get cumbersome if the number of lists grows, which it might.

Zip the three lists together, uniquify based on the first element, then unzip:

from operator import itemgetter
from more_itertools import unique_everseen

abc = zip(a, b, c)
abc_unique = unique_everseen(abc, key=itemgetter(0))
a, b, c = zip(*abc_unique)

This is a very common pattern. Whenever you want to do anything in lock step over a bunch of lists (or other iterables), you zip them together and loop over the result.

Also, if you go from 3 lists to 42 of them ("This could get cumbersome if the number of lists grows, which it might."), this is trivial to extend:

abc = zip(*list_of_lists)
abc_unique = unique_everseen(abc, key=itemgetter(0))
list_of_lists = zip(*abc_unique)

Once you get the hang of zip , the "uniquify" is the only hard part, so let me explain it.

Your existing code checks whether each element has been seen by searching for each one in new_A . Since new_A is a list, this means that if you have N elements, M of them unique, on average you're going to be doing M/2 comparisons for each of those N elements. Plug in some big numbers, and NM/2 gets pretty big—eg, 1 million values, a half of them unique, and you're doing 250 billion comparisons.

To avoid that quadratic time, you use a set . A set can test an element for membership in constant, rather than linear, time. So, instead of 250 billion comparisons, that's 1 million hash lookups.

If you don't need to maintain order or decorate-process-undecorate the values, just copy the list to a set and you're done. If you need to decorate, you can use a dict instead of a set (with the key as the dict keys, and everything else hidden in the values). To preserve order, you could use an OrderedDict , but at that point it's easier to just use a list and a set side by side. For example, the smallest change to your code that works is:

new_A_set = set()
new_A = []
new_B = []
new_C = []
for i in range(len(A)):
    if A[i] not in new_A_set:
        new_A_set.add(A[i])
        new_A.append(A[i])
        new_B.append(B[i])
        new_C.append(C[i])

But this can be generalized—and should be, especially if you're planning to expand from 3 lists to a whole lot of them.

The recipes in the itertools documentation include a function called unique_everseen that generalizes exactly what we want. You can copy and paste it into your code, write a simplified version yourself, or pip install more-itertools and use someone else's implementation (as I did above).


PadraicCunningham asks:

how efficient is zip(*unique_everseen(zip(a, b, c), key=itemgetter(0))) ?

If there are N elements, M unique, it's O(N) time and O(M) space.

In fact, it's effectively doing the same work as the 10-line version above. In both cases, the only work that's not obviously trivial inside the loop is key in seen and seen.add(key) , and since both operations are amortized constant time for set , that means the whole thing is O(N) time. In practice, for N= 1000000, M=100000 the two versions are about 278ms and 297ms (I forget which is which) compared to minutes for the quadratic version. You could probably micro-optimize that down to 250ms or so—but it's hard to imagine a case where you'd need that, but wouldn't benefit from running it in PyPy instead of CPython, or writing it in Cython or C, or numpy-izing it, or getting a faster computer, or parallelizing it.

As for space, the explicit version makes it pretty obvious. Like any conceivable non-mutating algorithm, we've got the three new_Foo lists around at the same time as the original lists, and we've also added new_A_set of the same size. Since all of those are length M , that's 4M space. We could cut that in half by doing one pass to get indices, then doing the same thing mu 無's answer does:

indices = set(zip(*unique_everseen(enumerate(a), key=itemgetter(1))[0])
a = [a[index] for index in indices]
b = [b[index] for index in indices]
c = [c[index] for index in indices]

But there's no way to go lower than that; you have to have at least a set and a list of length M alive to uniquify a list of length N in linear time.

If you really need to save space, you can mutate all three lists in-place. But this is a lot more complicated, and a bit slower (although still linear*).

Also, it's worth noting another advantage of the zip version: it works on any iterables. You can feed it three lazy iterators, and it won't have to instantiate them eagerly. I don't think it's doable in 2M space, but it's not too hard in 3M:

indices, a = zip(*unique_everseen(enumerate(a), key=itemgetter(1))
indices = set(indices)
b = [value for index, value in enumerate(b) if index in indices]
c = [value for index, value in enumerate(c) if index in indices]

* Note that just del c[i] will make it quadratic, because deleting from the middle of a list takes linear time. Fortunately, that linear time is a giant memmove that's orders of magnitude faster than the equivalent number of Python assignments, so if N isn't too big you can get away with it—in fact, at N=100000, M=10000 it's twice as fast as the immutable version… But if N might be too big, you have to instead replace each duplicate element with a sentinel, then loop over the list in a second pass so you can shift each element only once, which is instead 50% slower than the immutable version.

How about this - basically get a set of all unique elements of A, and then get their indices, and create a new list based on these indices.

new_A = list(set(A))
indices_to_copy = [A.index(element) for element in new_A]
new_B = [B[index] for index in indices_to_copy]
new_C = [C[index] for index in indices_to_copy]

You can write a function for the second statement, for reuse:

def get_new_list(original_list, indices):
    return [original_list[idx] for idx in indices]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM