Remove duplicates from one Python list, prune other lists based on it

Question

I have a problem that's easy enough to do in an ugly way, but I'm wondering if there's a more Pythonic way of doing it.

Say I have three lists, A , B and C .

A = [1, 1, 2, 3, 4, 4, 5, 5, 3]
B = [1, 2, 3, 4, 5, 6, 7, 8, 9]
C = [1, 2, 3, 4, 5, 6, 7, 8, 9]
# The actual data isn't important.

I need to remove all duplicates from list A , but when a duplicate entry is deleted, I would like the corresponding indexes removed from B and C :

A = [1, 2, 3, 4, 5]
B = [1, 3, 4, 5, 7]
C = [1, 3, 4, 5, 7]

This is easy enough to do with longer code by moving everything to new lists:

new_A = []
new_B = []
new_C = []
for i in range(len(A)):
  if A[i] not in new_A:
    new_A.append(A[i])
    new_B.append(B[i])
    new_C.append(C[i])

But is there a more elegant and efficient (and less repetitive) way of doing this? This could get cumbersome if the number of lists grows, which it might.

Answer 1

Zip the three lists together, uniquify based on the first element, then unzip:

from operator import itemgetter
from more_itertools import unique_everseen

abc = zip(a, b, c)
abc_unique = unique_everseen(abc, key=itemgetter(0))
a, b, c = zip(*abc_unique)

This is a very common pattern. Whenever you want to do anything in lock step over a bunch of lists (or other iterables), you zip them together and loop over the result.

Also, if you go from 3 lists to 42 of them ("This could get cumbersome if the number of lists grows, which it might."), this is trivial to extend:

abc = zip(*list_of_lists)
abc_unique = unique_everseen(abc, key=itemgetter(0))
list_of_lists = zip(*abc_unique)

Once you get the hang of zip , the "uniquify" is the only hard part, so let me explain it.

Your existing code checks whether each element has been seen by searching for each one in new_A . Since new_A is a list, this means that if you have N elements, M of them unique, on average you're going to be doing M/2 comparisons for each of those N elements. Plug in some big numbers, and NM/2 gets pretty big—eg, 1 million values, a half of them unique, and you're doing 250 billion comparisons.

To avoid that quadratic time, you use a set . A set can test an element for membership in constant, rather than linear, time. So, instead of 250 billion comparisons, that's 1 million hash lookups.

If you don't need to maintain order or decorate-process-undecorate the values, just copy the list to a set and you're done. If you need to decorate, you can use a dict instead of a set (with the key as the dict keys, and everything else hidden in the values). To preserve order, you could use an OrderedDict , but at that point it's easier to just use a list and a set side by side. For example, the smallest change to your code that works is:

new_A_set = set()
new_A = []
new_B = []
new_C = []
for i in range(len(A)):
    if A[i] not in new_A_set:
        new_A_set.add(A[i])
        new_A.append(A[i])
        new_B.append(B[i])
        new_C.append(C[i])

But this can be generalized—and should be, especially if you're planning to expand from 3 lists to a whole lot of them.

The recipes in the itertools documentation include a function called unique_everseen that generalizes exactly what we want. You can copy and paste it into your code, write a simplified version yourself, or pip install more-itertools and use someone else's implementation (as I did above).

PadraicCunningham asks:

how efficient is zip(*unique_everseen(zip(a, b, c), key=itemgetter(0))) ?

If there are N elements, M unique, it's O(N) time and O(M) space.

In fact, it's effectively doing the same work as the 10-line version above. In both cases, the only work that's not obviously trivial inside the loop is key in seen and seen.add(key) , and since both operations are amortized constant time for set , that means the whole thing is O(N) time. In practice, for N= 1000000, M=100000 the two versions are about 278ms and 297ms (I forget which is which) compared to minutes for the quadratic version. You could probably micro-optimize that down to 250ms or so—but it's hard to imagine a case where you'd need that, but wouldn't benefit from running it in PyPy instead of CPython, or writing it in Cython or C, or numpy-izing it, or getting a faster computer, or parallelizing it.

As for space, the explicit version makes it pretty obvious. Like any conceivable non-mutating algorithm, we've got the three new_Foo lists around at the same time as the original lists, and we've also added new_A_set of the same size. Since all of those are length M , that's 4M space. We could cut that in half by doing one pass to get indices, then doing the same thing mu 無's answer does:

indices = set(zip(*unique_everseen(enumerate(a), key=itemgetter(1))[0])
a = [a[index] for index in indices]
b = [b[index] for index in indices]
c = [c[index] for index in indices]

But there's no way to go lower than that; you have to have at least a set and a list of length M alive to uniquify a list of length N in linear time.

If you really need to save space, you can mutate all three lists in-place. But this is a lot more complicated, and a bit slower (although still linear*).

Also, it's worth noting another advantage of the zip version: it works on any iterables. You can feed it three lazy iterators, and it won't have to instantiate them eagerly. I don't think it's doable in 2M space, but it's not too hard in 3M:

indices, a = zip(*unique_everseen(enumerate(a), key=itemgetter(1))
indices = set(indices)
b = [value for index, value in enumerate(b) if index in indices]
c = [value for index, value in enumerate(c) if index in indices]

* Note that just del c[i] will make it quadratic, because deleting from the middle of a list takes linear time. Fortunately, that linear time is a giant memmove that's orders of magnitude faster than the equivalent number of Python assignments, so if N isn't too big you can get away with it—in fact, at N=100000, M=10000 it's twice as fast as the immutable version… But if N might be too big, you have to instead replace each duplicate element with a sentinel, then loop over the list in a second pass so you can shift each element only once, which is instead 50% slower than the immutable version.

Answer 2

How about this - basically get a set of all unique elements of A, and then get their indices, and create a new list based on these indices.

new_A = list(set(A))
indices_to_copy = [A.index(element) for element in new_A]
new_B = [B[index] for index in indices_to_copy]
new_C = [C[index] for index in indices_to_copy]

You can write a function for the second statement, for reuse:

def get_new_list(original_list, indices):
    return [original_list[idx] for idx in indices]

Remove duplicates from one Python list, prune other lists based on it

Question

2 answers

solution1
5 ACCPTED 2014-07-31 22:26:26

solution2
0 2014-07-31 22:20:51

Remove duplicates from one Python list, prune other lists based on it

Question

2 answers

solution1 5 ACCPTED 2014-07-31 22:26:26

solution2 0 2014-07-31 22:20:51

solution1
5 ACCPTED 2014-07-31 22:26:26

solution2
0 2014-07-31 22:20:51