Most pythonic way of ensuring a list of objects contains only unique items

Question

I have a list of objects (Foo). A Foo object has several attributes. An instance of a Foo object is equivalent (equal) to another instance of a Foo object iff (if and only if) all the attributes are equal.

I have the following code:

class Foo(object):
    def __init__(self, myid):
        self.myid=myid

    def __eq__(self, other):
        if isinstance(other, self.__class__):
            print 'DEBUG: self:',self.__dict__ 
            print 'DEBUG: other:',other.__dict__ 
            return self.__dict__ == other.__dict__
        else:
            print 'DEBUG: ATTEMPT TO COMPARE DIFFERENT CLASSES:',self.__class__,'compared to:', other.__class__
            return False    


import copy

f1 = Foo(1)
f2 = Foo(2)
f3 = Foo(3)
f4 = Foo(4)
f5 = copy.deepcopy(f3) # overkill here (I know), but needed for my real code

f_list = [f1,f2,f3,f4,f5]

# Surely, there must be a better way? (this dosen't work BTW!)
new_foo_list = list(set(f_list))

I often used this little (anti?) 'pattern' above (converting to set and back), when dealing with simple types (int, float, string - and surprisingly datetime.datetime types), but it has come a cropper with the more involved data type - like Foo above.

So, how could I change the list f1 above into a list of unique items - without having to loop through each item and doing a check on whether it already exists in some temporary cache etc etc?.

What is the most pythonic way to do this?

Answer 1

First, I want to emphasize that using set is certainly not an anti-pattern. set s eliminate duplicates in O(n) time, which is the best you can do, and way better than the naive O(n^2) solution of comparing every item to every other item. It's even better than sorting -- and indeed, it seems your data structure might not even have a natural order, in which case sorting doesn't make a lot of sense.

The problem with using a set in this case is that you have to define a custom __hash__ method. Others have said this. But whether or not you can do so easily is an open question -- it depends on details about your actual class that you haven't told us. For example, if any attributes of a Foo object above are not hashable, then creating a custom hash function is going to be difficult, because you'll have to not only write a custom hash for Foo objects, you'll also have to write custom hashes for every other type of object!

So you need to tell us more about what kinds of attributes your class has if you want a conclusive answer. But I can offer some speculation.

Assuming that a hash function could be written for Foo objects, but also assuming that that Foo objects are mutable and so really shouldn't have a __hash__ method, as Niklas B. points out, here is one workable approach. Create a function freeze that, given a mutable instance of Foo , returns an immutable collection of the data in Foo . So for example, say Foo has a dict and a list in it; freeze returns a tuple containing a tuple of tuple s (representing the dict ) and another tuple (representing the list ). The function freeze should have the following property:

freeze(a) == freeze(b)

If and only if

a == b

Now pass your list through the following code:

dupe_free = dict((freeze(x), x) for x in dupe_list).values()

Now you have a dupe free list in O(n) time. (Indeed, after adding this suggestion, I saw that fraxel suggested something similar; but I think using a custom function -- or even a method -- (x.freeze(), x) -- is the better way to go, rather than relying on __dict__ as he does, which can be unreliable. The same goes for your custom __eq__ method, IMO -- __dict__ is not always a safe shortcut for various reasons I can't get into here.)

Another approach would be to use only immutable objects in the first place! For example, you could use namedtuple s. Here's an example stolen from the python docs:

>>> Point = namedtuple('Point', ['x', 'y'])
>>> p = Point(11, y=22)     # instantiate with positional or keyword arguments
>>> p[0] + p[1]             # indexable like the plain tuple (11, 22)
33
>>> x, y = p                # unpack like a regular tuple
>>> x, y
(11, 22)
>>> p.x + p.y               # fields also accessible by name
33
>>> p                       # readable __repr__ with a name=value style
Point(x=11, y=22)

Answer 2

Have you tried using a set (or frozenset )? It's explicitly for holding a unique set of items.

You'll need to create an appropriate __hash__ method, though. set (and frozenset ) use the __hash__ method to hash objects; __eq__ is only used on a collision, AFAIK. Accordingly, you'll want to use a hash like hash(frozenset(self.__dict__.items())) .

Answer 3

According to the documentation , you need to define __hash__() and __eq__() for your custom class to work correctly with a set or frozenset , as both are implemented using hash tables in CPython.

If you implement __hash__ , keep in mind that if a == b , then hash(a) must equal hash(b) . Rather than comparing the whole __dict__ s, I suggest the following more straightforward implementation for your simple class:

class Foo(object):
    def __init__(self, myid):
        self.myid = myid

    def __eq__(self, other):
        return isinstance(other, self.__class__) and other.myid == self.myid

    def __hash__(self):
        return hash(self.myid)

If your object contains mutable attributes, you simply shouldn't put it inside a set or use it as a dictionary key.

Answer 4

Here is an alternative method, just make a dictionary keyed by __dict__.items() for the instances:

f_list = [f1,f2,f3,f4,f5]
f_dict = dict([(tuple(i.__dict__.items()), i) for i in f_list])
print f_dict
print f_dict.values()
#output:
{(('myid', 1),): <__main__.Foo object at 0xb75e190c>, 
 (('myid', 2),): <__main__.Foo object at 0xb75e184c>, 
 (('myid', 3),): <__main__.Foo object at 0xb75e1f6c>, 
 (('myid', 4),): <__main__.Foo object at 0xb75e1cec>}
[<__main__.Foo object at 0xb75e190c>, 
 <__main__.Foo object at 0xb75e184c>, 
 <__main__.Foo object at 0xb75e1f6c>, 
 <__main__.Foo object at 0xb75e1cec>]

This way you just let the dictionary take care of the uniqueness based on attributes, and can easily retrieve the objects by getting the values.

Answer 5

If you are allowed you can use a set http://docs.python.org/library/sets.html

list = [1,2,3,3,45,4,45,6]
print set(list)
set([1, 2, 3, 4, 6, 45])
x = set(list)
print x
set([1, 2, 3, 4, 6, 45])

Most pythonic way of ensuring a list of objects contains only unique items

Question

5 answers

solution1
8 ACCPTED 2012-05-10 14:08:27

solution2
3 2012-05-10 13:55:55

solution3
3 2012-05-10 14:01:07

solution4
1 2012-05-10 14:22:24

solution5
-1 2012-05-10 13:57:47

Most pythonic way of ensuring a list of objects contains only unique items

Question

5 answers

solution1 8 ACCPTED 2012-05-10 14:08:27

solution2 3 2012-05-10 13:55:55

solution3 3 2012-05-10 14:01:07

solution4 1 2012-05-10 14:22:24

solution5 -1 2012-05-10 13:57:47

solution1
8 ACCPTED 2012-05-10 14:08:27

solution2
3 2012-05-10 13:55:55

solution3
3 2012-05-10 14:01:07

solution4
1 2012-05-10 14:22:24

solution5
-1 2012-05-10 13:57:47