简体   繁体   English

使用python从不同长度的元组列表中删除重复项

[英]Remove duplicates from a list of tuples of different length with python

I extract specific names from text using regex etc. The result is a list of tuples containing titles and names. 我使用正则表达式等从文本中提取特定名称。结果是包含标题和名称的元组列表。 The tuples might be of a different length. 元组可能有不同的长度。 lst below shows a list of possible scenarios. lst下面显示的可能方案的列表。 I need to remove duplicate names from the result. 我需要从结果中删除重复的名称。 For example, ('Lord', 'Justice') == ('Lord', 'Justice', 'Smith'), and ('Lady', 'Smiles') == ('Lady', 'Justice', 'Smiles'), but ('Lord', 'Justice', 'Smith') and ('Lady', 'Justice', 'Smiles') are different names. 例如,('主','正义')==('主','正义','史密斯')和('女士','微笑')==('女士','正义','微笑'),但('主','正义','史密斯')和('女','正义','微笑')是不同的名字。 The desired output for each element in lst should be [('Lord', 'Justice', 'Smith'), ('Lady', 'Justice', 'Smiles')] . lst每个元素的期望输出应该是[('Lord', 'Justice', 'Smith'), ('Lady', 'Justice', 'Smiles')]

lst = [[('Lord', 'Justice', 'Smith'), ('Lady', 'Justice', 'Smiles')],
       [('Lord', 'Justice', 'Smith'), ('Lady', 'Justice', 'Smiles'), ('Lord', 'Justice')],
       [('Lord', 'Justice', 'Smith'), ('Lady', 'Smiles'), ('Lady', 'Justice', 'Smiles')],
       [('Lord', 'Justice', 'Smith'), ('Lady', 'Justice'), ('Lady', 'Justice', 'Smiles')],
       [('Lord', 'Justice', 'Smith'), ('Lady', 'Justice', 'Smiles'), ('Lady', 'Smiles')]]

This is what I have right now but it doesn't yield the desired output. 这就是我现在所拥有的,但它没有产生所需的输出。 Will really appreciate your help and suggestions. 非常感谢您的帮助和建议。

for l in lst:
    print(l)
    # remove duplicates based on the last index in tuples
    lst_1 = list(dict((v[-1],v) for v in sorted(l, key=lambda l: lst[0])).values())
    print(lst_1)
    # remove duplicates based on the second index [1] in tuples
    lst_2 = list(dict((v[1],v) for v in sorted(lst_1, key=lambda lst_1: lst_1[0])).values())    
    print(lst_2)
    print("\n")

UPDATE: 更新:

I was probably too specific in my examples. 在我的例子中,我可能过于具体了。 I had to include other names so the solution should work when there are other names present: 我必须包含其他名称,以便在存在其他名称时解决方案应该有效:

lst = [
[('Lord', 'Justice', 'Smith'), ('Lady', 'Justice', 'Smiles'), ('Lord', 'Other'), ('Lady', 'Another'), ('Lady', 'Diana', 'Spencer'), ('Lord', 'Dave', 'Castle')],
[('Lord', 'Justice', 'Smith'), ('Lady', 'Justice', 'Smiles'), ('Lord', 'Justice'), ('Lord', 'Other'), ('Lady', 'Another'), ('Lady', 'Diana', 'Spencer'), ('Lord', 'Dave', 'Castle')],
[('Lord', 'Justice', 'Smith'), ('Lady', 'Smiles'), ('Lady', 'Justice', 'Smiles'), ('Lord', 'Other'), ('Lady', 'Another'), ('Lady', 'Diana', 'Spencer'), ('Lord', 'Dave', 'Castle')],
[('Lord', 'Justice', 'Smith'), ('Lady', 'Justice'), ('Lady', 'Justice', 'Smiles'), ('Lord', 'Other'), ('Lady', 'Another'), ('Lady', 'Diana', 'Spencer'), ('Lord', 'Dave', 'Castle')],
[('Lord', 'Justice', 'Smith'), ('Lady', 'Justice', 'Smiles'), ('Lady', 'Smiles'), ('Lord', 'Other'), ('Lady', 'Another'), ('Lady', 'Diana', 'Spencer'), ('Lord', 'Dave', 'Castle')]
]

Desirable output: 理想的输出:

[('Lord', 'Justice', 'Smith'), ('Lady', 'Justice', 'Smiles'), ('Lord', 'Other'), ('Lady', 'Another'), ('Lady', 'Diana', 'Spencer'), ('Lord', 'Dave', 'Castle')]
[('Lord', 'Justice', 'Smith'), ('Lady', 'Justice', 'Smiles'), ('Lord', 'Other'), ('Lady', 'Another'), ('Lady', 'Diana', 'Spencer'), ('Lord', 'Dave', 'Castle')]
[('Lord', 'Justice', 'Smith'), ('Lady', 'Justice', 'Smiles'), ('Lord', 'Other'), ('Lady', 'Another'), ('Lady', 'Diana', 'Spencer'), ('Lord', 'Dave', 'Castle')]
[('Lord', 'Justice', 'Smith'), ('Lady', 'Justice', 'Smiles'), ('Lord', 'Other'), ('Lady', 'Another'), ('Lady', 'Diana', 'Spencer'), ('Lord', 'Dave', 'Castle')]
[('Lord', 'Justice', 'Smith'), ('Lady', 'Justice', 'Smiles'), ('Lord', 'Other'), ('Lady', 'Another'), ('Lady', 'Diana', 'Spencer'), ('Lord', 'Dave', 'Castle')]

I came with this solution: 我带来了这个解决方案:

from itertools import chain, groupby

lst = [
[('Lord', 'Justice', 'Smith'), ('Lady', 'Justice', 'Smiles')],
[('Lord', 'Justice', 'Smith'), ('Lady', 'Justice', 'Smiles'), ('Lord', 'Justice')],
[('Lord', 'Justice', 'Smith'), ('Lady', 'Smiles'), ('Lady', 'Justice', 'Smiles')],
[('Lord', 'Justice', 'Smith'), ('Lady', 'Justice'), ('Lady', 'Justice', 'Smiles')],
[('Lord', 'Justice', 'Smith'), ('Lady', 'Justice', 'Smiles'), ('Lady', 'Smiles')]
]

def remove_duplicates(lst):
    rv = []
    for g, v in groupby([g for g, _ in groupby(sorted(lst))], key=lambda v: v[0]):
        rv.append(max(list(v), key=lambda v: len(v)))
    return rv


for option in lst:
    print(remove_duplicates(option))

Outputs: 输出:

[('Lady', 'Justice', 'Smiles'), ('Lord', 'Justice', 'Smith')]
[('Lady', 'Justice', 'Smiles'), ('Lord', 'Justice', 'Smith')]
[('Lady', 'Justice', 'Smiles'), ('Lord', 'Justice', 'Smith')]
[('Lady', 'Justice', 'Smiles'), ('Lord', 'Justice', 'Smith')]
[('Lady', 'Justice', 'Smiles'), ('Lord', 'Justice', 'Smith')]

You can do this easily using itertools.groupby 您可以使用itertools.groupby轻松完成此操作

lst = [
[('Lord', 'Justice', 'Smith'), ('Lady', 'Justice', 'Smiles'), ('Lord', 'Other'), ('Lady', 'Another'), ('Lady', 'Diana', 'Spencer'), ('Lord', 'Dave', 'Castle')],
[('Lord', 'Justice', 'Smith'), ('Lady', 'Justice', 'Smiles'), ('Lord', 'Justice'), ('Lord', 'Other'), ('Lady', 'Another'), ('Lady', 'Diana', 'Spencer'), ('Lord', 'Dave', 'Castle')],
[('Lord', 'Justice', 'Smith'), ('Lady', 'Smiles'), ('Lady', 'Justice', 'Smiles'), ('Lord', 'Other'), ('Lady', 'Another'), ('Lady', 'Diana', 'Spencer'), ('Lord', 'Dave', 'Castle')],
[('Lord', 'Justice', 'Smith'), ('Lady', 'Justice'), ('Lady', 'Justice', 'Smiles'), ('Lord', 'Other'), ('Lady', 'Another'), ('Lady', 'Diana', 'Spencer'), ('Lord', 'Dave', 'Castle')],
[('Lord', 'Justice', 'Smith'), ('Lady', 'Justice', 'Smiles'), ('Lady', 'Smiles'), ('Lord', 'Other'), ('Lady', 'Another'), ('Lady', 'Diana', 'Spencer'), ('Lord', 'Dave', 'Castle')]
]
res = [[max(reversed(list(v)), key=len) for k,v in groupby(sl, lambda x: x[0])] for sl in lst]
for l in res:
    print(l)

Output 产量

[('Lord', 'Justice', 'Smith'), ('Lady', 'Justice', 'Smiles'), ('Lord', 'Other'), ('Lady', 'Diana', 'Spencer'), ('Lord', 'Dave', 'Castle')]
[('Lord', 'Justice', 'Smith'), ('Lady', 'Justice', 'Smiles'), ('Lord', 'Other'), ('Lady', 'Diana', 'Spencer'), ('Lord', 'Dave', 'Castle')]
[('Lord', 'Justice', 'Smith'), ('Lady', 'Justice', 'Smiles'), ('Lord', 'Other'), ('Lady', 'Diana', 'Spencer'), ('Lord', 'Dave', 'Castle')]
[('Lord', 'Justice', 'Smith'), ('Lady', 'Justice', 'Smiles'), ('Lord', 'Other'), ('Lady', 'Diana', 'Spencer'), ('Lord', 'Dave', 'Castle')]
[('Lord', 'Justice', 'Smith'), ('Lady', 'Justice', 'Smiles'), ('Lord', 'Other'), ('Lady', 'Diana', 'Spencer'), ('Lord', 'Dave', 'Castle')]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM