简体   繁体   中英

Find duplicates in a list of tuples

You are given information about users of your website. The information includes username, a phone number and/or an email. Write a program that takes in a list of tuples where each tuple represents information for a particular user and returns a list of lists where each sublist contains the indices of tuples containing information about the same person. For example:

Input:
[("MLGuy42", "andrew@example.com", "123-4567"),
("CS229DungeonMaster", "123-4567", "ml@example.net"),
("Doomguy", "john@example.org", "carmack@example.com"),
("andrew26", "andrew@example.com", "mlguy@example.com")]

Output:
[[0, 1, 3], [2]]

Since "MLGuy42", "CS229DungeonMaster" and "andrew26" are all the same person.

Each sublist in the output should be sorted and the outer list should be sorted by the first element in the sublist.


Below is the code snippet that I did for this problem. It seems to work fine, but I'm wondering if there is a better/optimized solution. Any help would be appreciated. Thanks!

def find_duplicates(user_info):
    results = list()
    seen = dict()
    for i, user in enumerate(user_info):
        first_seen = True
        key_info = None
        for info in user:
            if info in seen:
                first_seen = False
                key_info = info
                break
        if first_seen:
            results.append([i])
            pos = len(results) - 1
        else:
            index = seen[key_info]
            results[index].append(i)
            pos = index
        for info in user:
            seen[info] = pos
    return results

I think I've reached to an optimized working solution using a graph. Basically, I've created a graph with each node contains its user information and its index. Then, use dfs to traverse the graph and find the duplicates.

I think we can simplify this using sets:

from random import shuffle

def find_duplicates(user_info):

    reduced = unreduced = {frozenset(info): [i] for i, info in enumerate(user_info)}

    while reduced is unreduced or len(unreduced) > len(reduced):

        unreduced = dict(reduced)  # make a copy

        for identifiers_1, positions_1 in unreduced.items():

            for identifiers_2, positions_2 in unreduced.items():

                if identifiers_1 is identifiers_2:
                    continue

                if identifiers_1 & identifiers_2:
                    del reduced[identifiers_1], reduced[identifiers_2]
                    reduced[identifiers_1 | identifiers_2] = positions_1 + positions_2
                    break
            else:  # no break
                continue

            break

    return sorted(sorted(value) for value in reduced.values())

my_input = [ \
    ("CS229DungeonMaster", "123-4567", "ml@example.net"), \
    ("Doomguy", "john@example.org", "carmack@example.com"), \
    ("andrew26", "andrew@example.com", "mlguy@example.com"), \
    ("MLGuy42", "andrew@example.com", "123-4567"), \
]

shuffle(my_input)  # shuffle to prove order independence

print(my_input)
print(find_duplicates(my_input))

OUTPUT

> python3 test.py
[('CS229DungeonMaster', '123-4567', 'ml@example.net'), ('MLGuy42', 'andrew@example.com', '123-4567'), ('andrew26', 'andrew@example.com', 'mlguy@example.com'), ('Doomguy', 'john@example.org', 'carmack@example.com')]
[[0, 1, 2], [3]]
> 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM