简体   繁体   中英

remove duplicates from (tuple of tuples)

input_tuple = (
    (12805,'MLB','NAME1','body NAME1 Noah dev'),         
    (12805,'MLB','NAME2','body NAME2 Noah dev'),
    (12805,'MLB','NAME3','body NAME3 Elijah'),

    (12806,'MLB','NAME4','body NAME4 Liam sev'),
    (12806,'MLB','NAME5','body NAME5 Noah dev'),

    (12807,'MLB','NAME6','body NAME6 Liam sev'),
    (12807,'MLB','NAME7','body NAME7 epic peterson'),

    (12808,'MLB','NAME8','body NAME8 Liam sev'),
    (12808,'MLB','NAME9','body NAME9 epic peterson')         

    )

Consider in input_tuple first place digits are key and we need to check duplicate on [3] index, but please skip first two chars.

below are two scenarios we need to look:-

1) self duplicate check: now if there is any duplicate being found on the 3rd index we need to remove that row

2) across duplicate: once we checked self duplicate then we need to check across duplicate and if anything found we need to replace with first occurrence

3) across duplicate in pair: in this scenario I want to check across duplicate, but check should happen only if there is pair duplicate across. i have edited in input_tuple in last: eg:

    (12808,'MLB','NAME8','body NAME8 Liam sev'),
    (12808,'MLB','NAME9','body NAME9 epic peterson') 

since its duplicate with below:

(12807,'MLB','NAME6','body NAME6 Liam sev'),
(12807,'MLB','NAME7','body NAME7 epic peterson'),

hence, it should also be eliminated.

output_tuple = (
    (12805,'MLB','NAME1','body NAME1 Noah dev'),
    (12805,'MLB','NAME3','body NAME3 Elijah dev'),

    (12806,'MLB','NAME4','body NAME4 Liam sev'),
    (12806,'MLB','NAME1','body NAME1 Noah dev'),

    (12807,'MLB','NAME4','body NAME4 Liam sev'),
    (12807,'MLB','NAME7','body NAME7 epic peterson')
    )

Code which i tried:(its working fine in 1st scenario)

def skip_two_words(str):
    str = ' '.join(str.split(' ')[2:])
    return str

input_tuple = (tuple({(x[0], skip_two_words(x[3])): x for x in 
input_tuple[::-1]}.values())[::-1])

    id_name_dict = defaultdict(list)
    for id, _, _, name in input_tuple:
        id_name_dict[id].append(name)
    seen = set()
    ignore_id_set = set()
    for _id, _namelst in id_name_dict.items():
            id = tuple(sorted(_namelst))
            if id not in seen:
                seen.add(id)
            else:
                ignore_id_set.add(_id)  # duplicate
    del id_name_dict, seen  # id_name_dict,seen are now eligible for garbage 
    collection
    output_tuple = tuple(item for item in input_tuple if item[0] not in 
    ignore_id_set)

No need to reinvent the wheel to drop duplicates. The itertools docs has a unique_everseen recipe , also available in 3rd party libraries via more_itertools.unique_everseen or toolz.unique . The second part is a bit messy, but you can use a custom function to define your splits and then use a tuple comprehension.

from toolz import unique

# drop duplicates
res = tuple(unique(input_tuple, key=lambda x: (x[0], tuple(x[-1].split()[2:]))))

# make mapping dictionary
d = {' '.join(tup[-1].split()[2:]): tup[-2] for tup in reversed(input_tuple)}

# apply dictionary mapping with some splits
def return_tup(tup, d):
    num, cat, name_id, full = tup
    full_split = full.split()
    name_words = ' '.join(full_split[2:])
    name_id = d[name_words]
    full = ' '.join([full_split[0], name_id, name_words])
    return num, cat, name_id, full

res = tuple(return_tup(tup, d) for tup in res)

((12805, 'MLB', 'NAME1', 'body NAME1 Noah dev'),
 (12805, 'MLB', 'NAME3', 'body NAME3 Elijah'),
 (12806, 'MLB', 'NAME4', 'body NAME4 Liam sev'),
 (12806, 'MLB', 'NAME1', 'body NAME1 Noah dev'),
 (12807, 'MLB', 'NAME4', 'body NAME4 Liam sev'),
 (12807, 'MLB', 'NAME7', 'body NAME7 epic peterson'))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM