简体   繁体   English

使用 Python 删除 object 列表中的重复项

[英]Remove duplicates in list of object with Python

I've got a list of objects and I've got a db table full of records.我有一个对象列表,我有一个充满记录的数据库表。 My list of objects has a title attribute and I want to remove any objects with duplicate titles from the list (leaving the original).我的对象列表有一个标题属性,我想从列表中删除所有具有重复标题的对象(保留原始标题)。

Then I want to check if my list of objects has any duplicates of any records in the database and if so, remove those items from list before adding them to the database.然后我想检查我的对象列表是否有数据库中任何记录的重复项,如果有,在将它们添加到数据库之前从列表中删除这些项目。

I have seen solutions for removing duplicates from a list like this: myList = list(set(myList)) , but i'm not sure how to do that with a list of objects?我已经看到了从这样的列表中删除重复项的解决方案: myList = list(set(myList)) ,但我不确定如何使用对象列表来做到这一点?

I need to maintain the order of my list of objects too.我也需要维护我的对象列表的顺序。 I was also thinking maybe I could use difflib to check for differences in the titles.我也在想也许我可以使用difflib来检查标题中的差异。

The set(list_of_objects) will only remove the duplicates if you know what a duplicate is, that is, you'll need to define a uniqueness of an object. set(list_of_objects)仅在您知道重复项是什么时才会删除重复项,也就是说,您需要定义对象的唯一性。

In order to do that, you'll need to make the object hashable.为此,您需要使对象可散列。 You need to define both __hash__ and __eq__ method, here is how:您需要同时定义__hash____eq__方法,方法如下:

http://docs.python.org/glossary.html#term-hashable http://docs.python.org/glossary.html#term-hashable

Though, you'll probably only need to define __eq__ method.不过,您可能只需要定义__eq__方法。

EDIT : How to implement the __eq__ method:编辑:如何实现__eq__方法:

You'll need to know, as I mentioned, the uniqueness definition of your object.正如我提到的,您需要知道对象的唯一性定义。 Supposed we have a Book with attributes author_name and title that their combination is unique, (so, we can have many books Stephen King authored, and many books named The Shining, but only one book named The Shining by Stephen King), then the implementation is as follows:假设我们有一本书的属性 author_name 和 title 它们的组合是唯一的,(所以,我们可以有很多史蒂芬金的书,很多书叫闪灵,但只有一本斯蒂芬金的书叫闪灵),那么实现如下:

def __eq__(self, other):
    return self.author_name==other.author_name\
           and self.title==other.title

Similarly, this is how I sometimes implement the __hash__ method:同样,这就是我有时实现__hash__方法的方式:

def __hash__(self):
    return hash(('title', self.title,
                 'author_name', self.author_name))

You can check that if you create a list of 2 books with same author and title, the book objects will be the same (with is operator) and equal (with == operator).您可以检查,如果您创建了一个包含相同作者和标题的 2 本书的列表,则书对象将相同(使用is运算符)和相等(使用==运算符)。 Also, when set() is used, it will remove one book.此外,当使用set()时,它将删除一本书。

EDIT : This is one old anwser of mine, but I only now notice that it has the error which is corrected with strikethrough in the last paragraph: objects with the same hash() won't give True when compared with is .编辑:这是我的一个旧答案,但我现在才注意到它有一个错误,在最后一段中用删除线纠正:与is相比,具有相同hash()对象不会给出True Hashability of object is used, however, if you intend to use them as elements of set, or as keys in dictionary.但是,如果您打算将它们用作集合的元素或字典中的键,则使用对象的可散列性。

Since they're not hashable, you can't use a set directly.由于它们不可散列,因此您不能直接使用集合。 The titles should be though.标题应该是。

Here's the first part.这是第一部分。

seen_titles = set()
new_list = []
for obj in myList:
    if obj.title not in seen_titles:
        new_list.append(obj)
        seen_titles.add(obj.title)

You're going to need to describe what database/ORM etc. you're using for the second part though.您将需要描述您在第二部分使用的数据库/ORM 等。

This seems pretty minimal:这似乎很小:

new_dict = dict()
for obj in myList:
    if obj.title not in new_dict:
        new_dict[obj.title] = obj

Both __hash__ and __eq__ are needed for this. __hash__需要__hash____eq__

__hash__ is needed to add an object to a set, since python's sets are implemented as hashtables .需要__hash__将对象添加到集合中,因为python 的集合是作为 hashtables 实现的 By default, immutable objects like numbers, strings, and tuples are hashable.默认情况下,像数字、字符串和元组这样的不可变对象是可散列的。

However, hash collisions (two distinct objects hashing to the same value) are inevitable, due to the pigeonhole principle.然而,由于鸽巢原理,散列冲突(两个不同的对象散列到相同的值)是不可避免的。 So, two objects cannot be distinguished only using their hash, and the user must specify their own __eq__ function.因此,不能仅使用哈希来区分两个对象,用户必须指定自己的__eq__函数。 Thus, the actual hash function the user provides is not crucial, though it is best to try to avoid hash collisions for performance (see What's a correct and good way to implement __hash__()? ).因此,用户提供的实际散列函数并不重要,尽管为了性能最好尽量避免散列冲突(请参阅什么是实现 __hash__() 的正确和好方法? )。

I recently ended up using the code below.我最近最终使用了下面的代码。 It is similar to other answers as it iterates over the list and records what it is seeing and then removes any item that it has already seen but it doesn't create a duplicate list, instead it just deletes the item from original list.它类似于其他答案,因为它遍历列表并记录它所看到的内容,然后删除它已经看到的任何项目,但它不会创建重复的列表,而只是从原始列表中删除该项目。

seen = {}
for obj in objList:
    if obj["key-property"] in seen.keys():
        objList.remove(obj)
    else:
        seen[obj["key-property"]] = 1

If you can't (or won't) define __eq__ for the objects, you can use a dict-comprehension to achieve the same end:如果您不能(或不会)为对象定义__eq__ ,则可以使用 dict-comprehension 来实现相同的目的:

unique = list({item.attribute:item for item in mylist}.values())

Note that this will contain the last instance of a given key, eg for mylist = [Item(attribute=1, tag='first'), Item(attribute=1, tag='second'), Item(attribute=2, tag='third')] you get [Item(attribute=1, tag='second'), Item(attribute=2, tag='third')] .请注意,这将包含给定键的最后一个实例,例如对于mylist = [Item(attribute=1, tag='first'), Item(attribute=1, tag='second'), Item(attribute=2, tag='third')]你得到[Item(attribute=1, tag='second'), Item(attribute=2, tag='third')] You can get around this by using mylist[::-1] (if the full list is present).您可以使用mylist[::-1] (如果存在完整列表)来解决此问题。

For non-hashable types you can use a dictionary comprehension to remove duplicate objects based on a field in all objects.对于不可散列的类型,您可以使用字典理解来根据所有对象中的字段删除重复对象。 This is particularly useful for Pydantic which doesn't support hashable types by default :这对于默认不支持可散列类型的Pydantic 特别有用:

{ row.title : row for row in rows }.values()

Note that this will consider duplicates solely based on on row.title , and will take the last matched object for row.title .请注意,这将仅基于row.title考虑重复项,并将最后匹配的 object 用于row.title This means if your rows may have the same title but different values in other attributes, then this won't work.这意味着如果您的行可能具有相同的标题但在其他属性中具有不同的值,那么这将不起作用。

eg [{"title": "test", "myval": 1}, {"title": "test", "myval": 2}] ==> [{"title": "test", "myval": 2}]例如[{"title": "test", "myval": 1}, {"title": "test", "myval": 2}] ==> [{"title": "test", "myval": 2}]

If you wanted to match against multiple fields in row , you could extend this further:如果你想匹配row中的多个字段,你可以进一步扩展它:

{ f"{row.title}\0{row.value}" : row for row in rows }.values()

The null character \0 is used as a separator between fields. null 字符\0用作字段之间的分隔符。 This assumes that the null character isn't used in either row.title or row.value .这假定 null 字符未在row.titlerow.value中使用。

If you want to preserve the original order use it:如果您想保留原始订单,请使用它:

seen = {}
new_list = [seen.setdefault(x, x) for x in my_list if x not in seen]

If you don't care of ordering then use it:如果您不关心订购,请使用它:

new_list = list(set(my_list))

Its quite easy freinds :-它很容易朋友: -

a = [5,6,7,32,32,32,32,32,32,32,32] a = [5,6,7,32,32,32,32,32,32,32,32]

a = list(set(a)) a = 列表(集合(a))

print (a)打印(一)

[5,6,7,32]

thats it !就是这样 ! :) :)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM