简体   繁体   English

从(元组的元组)中删除重复项

[英]remove duplicates from (tuple of tuples)

input_tuple = (
    (12805,'MLB','NAME1','body NAME1 Noah dev'),         
    (12805,'MLB','NAME2','body NAME2 Noah dev'),
    (12805,'MLB','NAME3','body NAME3 Elijah'),

    (12806,'MLB','NAME4','body NAME4 Liam sev'),
    (12806,'MLB','NAME5','body NAME5 Noah dev'),

    (12807,'MLB','NAME6','body NAME6 Liam sev'),
    (12807,'MLB','NAME7','body NAME7 epic peterson'),

    (12808,'MLB','NAME8','body NAME8 Liam sev'),
    (12808,'MLB','NAME9','body NAME9 epic peterson')         

    )

Consider in input_tuple first place digits are key and we need to check duplicate on [3] index, but please skip first two chars. 考虑到在input_tuple中,第一个数字是关键,我们需要检查[3]索引上的重复数字,但是请跳过前两个字符。

below are two scenarios we need to look:- 下面是我们需要看的两种情况:-

1) self duplicate check: now if there is any duplicate being found on the 3rd index we need to remove that row 1)自我重复检查:现在,如果在第三个索引上发现任何重复,我们需要删除该行

2) across duplicate: once we checked self duplicate then we need to check across duplicate and if anything found we need to replace with first occurrence 2)重复检查:一旦我们检查了自己的重复检查,我们就需要检查重复检查,如果发现任何问题,我们需要用第一次出现的地方替换

3) across duplicate in pair: in this scenario I want to check across duplicate, but check should happen only if there is pair duplicate across. 3)成对重复:在这种情况下,我想检查重复项,但只有在存在重复对的情况下,才应进行检查。 i have edited in input_tuple in last: eg: 我最后在input_tuple中进行了编辑:例如:

    (12808,'MLB','NAME8','body NAME8 Liam sev'),
    (12808,'MLB','NAME9','body NAME9 epic peterson') 

since its duplicate with below: 由于其重复如下:

(12807,'MLB','NAME6','body NAME6 Liam sev'),
(12807,'MLB','NAME7','body NAME7 epic peterson'),

hence, it should also be eliminated. 因此,也应消除它。

output_tuple = (
    (12805,'MLB','NAME1','body NAME1 Noah dev'),
    (12805,'MLB','NAME3','body NAME3 Elijah dev'),

    (12806,'MLB','NAME4','body NAME4 Liam sev'),
    (12806,'MLB','NAME1','body NAME1 Noah dev'),

    (12807,'MLB','NAME4','body NAME4 Liam sev'),
    (12807,'MLB','NAME7','body NAME7 epic peterson')
    )

Code which i tried:(its working fine in 1st scenario) 我尝试过的代码:(在第一种情况下工作正常)

def skip_two_words(str):
    str = ' '.join(str.split(' ')[2:])
    return str

input_tuple = (tuple({(x[0], skip_two_words(x[3])): x for x in 
input_tuple[::-1]}.values())[::-1])

    id_name_dict = defaultdict(list)
    for id, _, _, name in input_tuple:
        id_name_dict[id].append(name)
    seen = set()
    ignore_id_set = set()
    for _id, _namelst in id_name_dict.items():
            id = tuple(sorted(_namelst))
            if id not in seen:
                seen.add(id)
            else:
                ignore_id_set.add(_id)  # duplicate
    del id_name_dict, seen  # id_name_dict,seen are now eligible for garbage 
    collection
    output_tuple = tuple(item for item in input_tuple if item[0] not in 
    ignore_id_set)

No need to reinvent the wheel to drop duplicates. 无需重新发明轮子来放置重复项。 The itertools docs has a unique_everseen recipe , also available in 3rd party libraries via more_itertools.unique_everseen or toolz.unique . itertools文档具有unique_everseen配方 ,也可以通过more_itertools.unique_everseentoolz.unique在第三方库中使用。 The second part is a bit messy, but you can use a custom function to define your splits and then use a tuple comprehension. 第二部分有点混乱,但是您可以使用自定义函数定义拆分,然后使用元组理解。

from toolz import unique

# drop duplicates
res = tuple(unique(input_tuple, key=lambda x: (x[0], tuple(x[-1].split()[2:]))))

# make mapping dictionary
d = {' '.join(tup[-1].split()[2:]): tup[-2] for tup in reversed(input_tuple)}

# apply dictionary mapping with some splits
def return_tup(tup, d):
    num, cat, name_id, full = tup
    full_split = full.split()
    name_words = ' '.join(full_split[2:])
    name_id = d[name_words]
    full = ' '.join([full_split[0], name_id, name_words])
    return num, cat, name_id, full

res = tuple(return_tup(tup, d) for tup in res)

((12805, 'MLB', 'NAME1', 'body NAME1 Noah dev'),
 (12805, 'MLB', 'NAME3', 'body NAME3 Elijah'),
 (12806, 'MLB', 'NAME4', 'body NAME4 Liam sev'),
 (12806, 'MLB', 'NAME1', 'body NAME1 Noah dev'),
 (12807, 'MLB', 'NAME4', 'body NAME4 Liam sev'),
 (12807, 'MLB', 'NAME7', 'body NAME7 epic peterson'))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM