简体   繁体   English

删除重复项(元组的元组)

[英]remove duplicates (tuple of tuples)

input data: 输入数据:

input_tuple = (
            (1, 'name1', 'Noah'),
            (1, 'name2', 'Liam'),

            (2, 'name3', 'Mason'),

            (3, 'name4', 'Mason'),

            (4, 'name5', 'Noah'),
            (4, 'name6', 'Liam'),

            (5, 'name7', 'Elijah'),
            (5, 'name8', 'Noah'),
            (5, 'name9', 'Liam')
          )

converted into dict(key, value): 转换为dict(key,value):

input_tuple = {
         1: [['name1', 'Noah'], ['name2', 'Liam']],
         2: [['name3', 'Mason']],
         3: [['name4', 'Mason']],
         4: [['name5', 'Noah'], ['name6', 'Liam']],
         5: [['name7', 'Elijah'], ['name8', 'Noah'], 
             ['name9', 'Liam']]
         }

did some more filter just for understanding the data model: 为了了解数据模型做了更多的过滤:

    dict =   
    {
    1: ['Noah', 'Liam'],
    2: ['Mason'],
    3: ['Mason'],
    4: ['Noah', 'Liam'],
    5: ['Elijah', 'Noah', 'Liam']
    }

Now i want to eliminate duplicate and then revert back to tuple like below: duplicate matching conditions: 1) eliminate duplicate if len(value) > 1 2) value should exact match not partial. 现在,我想消除重复项,然后返回到元组,如下所示:重复匹配条件:1)如果len(value)> 1 2)值应完全匹配而不是部分匹配,则消除重复项。

Note: key 2 and 3 value is not duplicate because len(value) is not -gt 1 key 4 value has gone because its exact duplicate since we are doing exact matching, hence in key 5 value ['Noah', Liam] will not go. 注意:键2和3的值不是重复的,因为len(value)不是-gt 1键4的值已经消失了,因为它正好是重复的,因为我们正在进行精确匹配,因此键5中的值['Noah',Liam]不会走。

 output_tuple = 
      (
        (1, 'name1', 'Noah'),
        (1, 'name2', 'Liam'),

        (2, 'name3', 'Mason'),

        (3, 'name4', 'Mason'),

        (5, 'name7', 'Elijah'),
        (5, 'name8', 'Noah'),
        (5, 'name9', 'Liam')
      )

code which i tried: 我尝试过的代码:

from functools import reduce
from collections import defaultdict

input_tuple_dictionary = defaultdict(list)
for (key, *value) in input_tuple:
    input_tuple_dictionary[key].append(value[1])

input_tuple_dictionary
for index in range(len(input_tuple_dictionary)-1):
    for key, value in input_tuple_dictionary.items():
        if len(value) > 1:
            if value == value[index+1]:
                print(key)
# Using the dict format of yours
data = [set(dict[x]) for x in range(1, len(dict) + 1)]
input_tuple = dict
seen = []
output_tuple = []
for i in range(len(data)):
    if (data[i] not in seen) or len(data[i]) == 1:
        for j in range(len(input_data)):
            if input_data[j][0] == i + 1:
                output_tuple.append(input_data[j])
    seen.append(data[i])
output_tuple = tuple(output_tuple)

If you did not understand please ask 如果您不明白,请询问

Good Luck 祝好运

One common solution for skipping duplicates is to keep a set that contains all of the elements you've already seen. 跳过重复项的一种常见解决方案是保留一个包含您已经看到的所有元素的集合。 If the object has been seen before, you don't add it to the result. 如果该对象以前见过,则不要将其添加到结果中。

The tricky bit is that the object you're trying to un-duplicate is the aggregation of multiple objects that reside within different tuples in your collection. 棘手的一点是,您要尝试删除的对象是集合中位于不同元组中的多个对象的集合。 Using groupby is an effective way to get those objects together in one convenient package. 使用groupby是在一个方便的程序包中将这些对象聚集在一起的有效方法。

from itertools import groupby

input_tuple = (
    (1, 'name1', 'Noah'),
    (1, 'name2', 'Liam'),

    (2, 'name3', 'Mason'),

    (3, 'name4', 'Mason'),

    (4, 'name5', 'Noah'),
    (4, 'name6', 'Liam'),

    (5, 'name7', 'Elijah'),
    (5, 'name8', 'Noah'),
    (5, 'name9', 'Liam')
  )

seen = set()
result = []
for _, group in groupby(input_tuple, key=lambda t: t[0]):
    #convert from iterator to list, since we want to iterate this more than once
    group = list(group)
    #extract just the names from each tuple.
    names = tuple(t[2] for t in group)
    #check for duplicates, but only for name groups with more than one element.
    if len(names) == 1 or names not in seen:
        result.extend(group)
    seen.add(names)

print(result)

Result: 结果:

[(1, 'name1', 'Noah'), (1, 'name2', 'Liam'), (2, 'name3', 'Mason'), (3, 'name4', 'Mason'), (5, 'name7', 'Elijah'), (5, 'name8', 'Noah'), (5, 'name9', 'Liam')]
from collections import defaultdict

dct = defaultdict(list) 

for k,n_id,name in input_tuple:
    dct[k].append(name)

#print(dct)

seen = set()
ignore_id_set = set()

for _id, _namelst in dct.items():
    if len(_namelst) > 1:
        k = tuple(sorted(_namelst)) # note 1 

        if k not in seen:
            seen.add(k)
        else:
            ignore_id_set.add(_id) # duplicate

#print(seen)

# del dct,seen # dct,seen are now eligible for garbage collection

output = tuple(item for item in input_tuple if item[0] not in ignore_id_set)
print(output)


'''
note 1:
    important to sort **if** situations like this can be possible
     (1, 'name1', 'Noah'),
     (1, 'name2', 'Liam'),

     (4, 'name6', 'Liam'),
     (4, 'name5', 'Noah'),

     because when we will create dct it will result in 

     1 : [Noah,Liam]
     4 : [Liam,Noah]

     since we want to treat them as duplicates we need to sort before creating their hash( via tuple)

**else** no need to do sort

'''

Here is one solution using a defaultdict of set objects and toolz.unique . 下面是一个使用一个溶液defaultdictset对象和toolz.unique toolz.unique is equivalent to the itertools unique_everseen recipe available in the docs. toolz.unique等效于文档中提供的itertools unique_everseen配方

The idea is to find keys with lone values and also keys which do not have duplicate values. 这样做的想法是找到具有单独值的键以及没有重复值的键。 The union of these two categories make up your result. 这两类的结合让你的结果。

from collections import defaultdict
from toolz import unique

dd = defaultdict(set)

for k, _, v in input_tuple:
    dd[k].add(v)

lones = {k for k, v in dd.items() if len(v) == 1}
uniques = set(unique(dd, key=lambda x: frozenset(dd[x])))

res = tuple(i for i in input_tuple if i[0] in lones | uniques)

Result: 结果:

print(res)

((1, 'name1', 'Noah'),
 (1, 'name2', 'Liam'),
 (2, 'name3', 'Mason'),
 (3, 'name4', 'Mason'),
 (5, 'name7', 'Elijah'),
 (5, 'name8', 'Noah'),
 (5, 'name9', 'Liam'))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM