简体   繁体   中英

Efficiently remove partial duplicates in a list of tuples

I have a list of tuples, the list can vary in length between ~8 - 1000 depending on the length of the tuples. Each tuple in the list is unique. A tuple is of length N where each entry is a generic word.

An example tuple can be of length N (Word 1, Word 2, Word 3, ..., Word N)

For any tuple in the list, element j in said tuple will either be '' or Word j

A very simplified example with alphabetic letters would be

l = [('A', 'B', '', ''), ('A', 'B', 'C', ''), 
     ('', '', '', 'D'), ('A', '', '', 'D'), 
     ('', 'B', '', '')]

Every position at each tuple will either have the same value or be empty. I want to remove all the tuples which have all their non '' values in another tuple at the same position. As an example, (A,B,'','') has all its non '' values in (A,B,C,'') and should therefore be removed.

filtered_l = [(A,B,C,''),(A,'','',D)]

The length of the tuples is always of the same length (not necessarily 4). The length of the tuples would be between 2-10.

What is the fastest way to do this?

Let's conceptualize each tuple as a binary array, where 1 is "contains something" and 2 is "contains an empty string". Since the item at each position will be the same, we don't need to care what is at each position, only that something is.

l = [('A','B','',''),('A','B','C',''),('','','','D'),('A','','','D'),('','B','','')]
l_bin = [sum(2**i if k else 0 for i,k in enumerate(tup)) for tup in l]
# [3, 7, 8, 9, 2]
# [0b0011, 0b0111, 0b1000, 0b1001, 0b0010]
# that it's backwards doesn't really matter, since it's consistent

Now, we can walk through that list and build a new datastructure without 'duplicates'. Since we have our tuples encoded as binary, we can determine a duplicate, 'encompassed' by another, by doing bitwise operations - given a and b , if a | b == a a | b == a , then a must contain b .

codes = {}
for tup, b in zip(l, l_bin):
    # check if any existing code contains the potential new one
    # in this case, skip adding the new one
    if any(a | b == a for a in codes):
        continue
    # check if the new code contains a potential existing one or more
    # in which case, replace the existing code(s) with the new code
    for a in list(codes):
        if b | a == b:
            codes.pop(a)
    # and finally, add this code to our datastructure
    codes[b] = tup

Now we can withdraw our 'filtered' list of tuples:

output = list(codes.values())
# [('A', 'B', 'C', ''), ('A', '', '', 'D')]

Note that (A, B, C, '') contains both (A, B, '', '') and ('', B, '', '') , and that (A, '', '', D') contains ('', '', '', D) , so this should be correct.

As of python 3.8, dict preserves insertion order, so the output should be in the same order that the tuples originally appeared in the list.

This solution wouldn't be perfectly efficient, since the number of codes might stack up, but it should be between O(n) and O(n^2), depending on the number of unique codes left at the end (and since the length of each tuple is significantly less than the length of l , it should be closer to O(n) than to O(n^2).

For that limit in particular, the obvious solution would be to convert each tuple to bit mask, accumulate them in a counter array, perform subset sum transformation, then filter the array l .

See detailed code explanation in the comment.

Time complexity is obviously n + m * 2^m , where n is the number of tuples and m is the length of each tuple. For n == 1000 and m == 10 , this is obviously faster than n^2 .

l = [('A','B','',''),('A','B','C',''),('','','','D'),('A','','','D'),('','B','','')]
# assumes that l is not empty. (to access l[0])
# The case where l is empty is trivial to handle.

def tuple_to_mask(tuple_):
    # convert the information whether each value in (tuple_) is empty to a bit mask
    # (1 is empty, 0 is not empty)
    return sum((value == '') << index for index, value in enumerate(tuple_))


count = [0] * (1 << len(l[0]))
for tuple_ in l:
    # tuple_ is a tuple.
    count[tuple_to_mask(tuple_)] += 1

# now count[mask] is the number of tuples in l with that mask

# transform the count array.
for dimension in range(len(l[0])):
    for mask in range(len(count)):
        if mask >> dimension & 1:
            count[mask] += count[mask - (1 << dimension)]

# now count[mask] is the number of tuples in l with a mask (mask_) such that (mask) contains (mask_)
# (i.e. all the bits that are set in mask_ are also set in mask)


filtered_l = [tuple_ for tuple_ in l if count[tuple_to_mask(tuple_)] == 1]
print(filtered_l)

I'm not sure if this is the most efficient or pythonic way, but this would be the straight-forward approach (again, maybe others will come with more sophisticated list-comprehension method):

take a look at this:

l = [('A','B','',''),('A','B','C',''),('','','','D'),('A','','','D'),('','B','','')]

def item_in_list(item, l):
    for item2comp in l:
        if item!=item2comp:
            found = True
            for part,rhs_part in zip(item, item2comp):
                if part!='' and part!=rhs_part:
                    found = False
                    break
            if found:
                return True
    return False
            
                
            
new_arr = []
for item in l:
    if not item_in_list(item, l):
        new_arr.append(item)
print(new_arr)

output:

[('A', 'B', 'C', ''), ('A', '', '', 'D')]

time complexity as I see it is - O((N**2)*M)

N - number of elements in list

M - number of parts in each element

L = [('A', 'B','',''),('A','B','C',''),('','','','D'),('A','','','D'),('','B','','')]
keys = collections.defaultdict(lambda: collections.defaultdict(set))

# maintain a record of tuple-indices that contain each character in each position
for i,t in enumerate(L):
    for c,e in enumerate(t):
        if not e: continue
        keys[e][c].add(i)

delme = set()
for i,t in enumerate(L):
    collocs = set.intersection(*[keys[e][c] for c,e in enumerate(t) if e])
    if len(collocs)>1:  # if all characters appear in this position in >1 index
        # ignore the collocation with the most non-empty characters
        # mark the rest for deletion
        C = max(collocs, key=lambda i: sum(bool(e) for bool in L[i]))
        for c in collocs:
            if c!=C: delme.add(c)

filtered = [t for i,t in enumerate(L) if i not in delme]

The strings are always at the same place so I replaced them by boolean values in order to compare them more easily. First I'm sorting, then I'm keeping only the elements if, compared to all other elements, the former element is always true everywhere or the same as the following element. Then when the comparison is done, I'm removing it from the list.

f = sorted(map(lambda x: list(map(bool, x)), l), key=sum, reverse=True)

to_keep = []

while len(f) > 1:
    if all(map(lambda x, y: True if x == y or x else False, f[0], f[1])):
        to_keep.append(len(l) - len(f) + 1)
    f = f[1:]

print([l[i] for i in to_keep])
[('A', 'B', 'C', ''), ('A', '', '', 'D')]

At 43.7 µs, it's also twice as fast as the top voted answer .

Consider each sequence a set. Now we simply discard all subsets.

Given

import itertools as it


expected = {("A", "B", "C", ""), ("A", "", "", "D")}
data = [
    ("A", "B", "", ""),
    ("A", "B", "C", ""), 
    ("", "", "", "D"), 
    ("A", "", "", "D"), 
    ("", "B", "", "")
]

Code

An iterative solution that converts and compares sets.

def discard_subsets(pool: list) -> set:
    """Return a set without subsets."""
    discarded = set()

    for n, k in it.product(pool, repeat=2):                 # 1

        if set(k) < set(n)):                                # 2
            discarded.add(k)

    return set(pool) - discarded                            # 3

A similar one-line solution

set(data) - {k for n, k in it.product(data, repeat=2) if set(k) < set(n)}

Demo

discard_subsets(data)
# {('A', '', '', 'D'), ('A', 'B', 'C', '')}

Details

The latter function is annotated to help explain each part:

  1. Compare all elements with each other. (Or use nested loops).
  2. If an element is a proper subset (see below), discard it.
  3. Remove discarded elements from the pool.

Why use sets?

Each element of the pool can be a set since the pertinent sub-elements are unique, ie "A", "B", "C", "D", "" .

Sets have membership properties. So saying, as an example,

("A", "B", "", "") has all values in ("A", "B", "C", "")

can also be stated

the set {"A", "B", "", ""} is a subset of {"A", "B", "C", ""}

All that remains is to compare all elements and reject all proper subsets .

a, a_, ac = {"a"}, {"a"}, {"a", "c"}

# Subsets
assert a.issubset(a_)                                       
assert a <= a_
assert a <= ac

# Proper subsets
assert not a < a_
assert a < ac

Complexity

Since we basically have nested loops, at best we get O(n^2) complexity. It may not be the most efficient approach, but it should hopefully be clear enough to follow.

Tests

f = discard_subsets
assert {("A", "B", "C", "")} == f([("A", "B", "", ""), ("A", "B", "C", "")])
assert {("A", "B", "C", "")} == f([("A", "B", "C", ""), ("A", "B", "", "")])
assert {("A", "B", "C", ""), ("", "", "", "D")} == f([("A", "B", "", ""), ("A", "B", "C", ""), ("", "", "", "D")])
assert {("A", "B", "C", ""), ("", "", "", "D")} == f([("", "", "", "D"), ("A", "B", "", ""), ("A", "B", "C", "")])
assert {("A", "B", "C", ""), ("", "", "", "D")} == f([("A", "B", "C", ""), ("", "", "", "D"), ("A", "B", "", "")])
assert {("A", "B", "C", ""), ("", "", "", "D")} == f([("A", "B", "C", ""), ("A", "B", "", ""), ("", "", "", "D")])
assert {("A","","C"), ("","B","C"), ("A","B","")} == f([("A","","C"),("","B","C"),("","","C"),("A","",""),("","",""),("A","B",""),("","B","")])
assert set(expected) == f(data)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM