简体   繁体   中英

Dictionary with custom tuples as keys, to circunvent a Pandas “bug” doesn't work as expected

I'm trying to create a class which acts like a dictionary whose keys are tuples, but I don't want them to be "truly" tuples, because I'll use this dictionary to create Pandas dataframes, and Pandas assume that tuples as keys mean a multi-index (which is not correct in this case).

In the case of tuples of a single element, this produces a bug, as for example:

>>> a = {(1,): 1 }
>>> pd.Series(a)
1      NaN
dtype: float64

What happens is that Pandas sees that the key of the dictionary is a tuple, so it assumes a multi-index. Then, it sees that the len of the tuple is 1, so it decides to create a plain index after all. But if fails to store the value, because the dictionary has not the key 1 , but the key (1,) instead, hence the NaN .

Leaving apart this bug, using "normal" tuples with several elements, Pandas works fine, but assumes a multi-level index which I don't want:

>>> a = {(1,2): 1 }
>>>> pd.Series(a)
1  2    1
dtype: int64

What I want instead is to use as index the tuple (1,2) .

I decided to implement my own Tuple class, like this (imitating the implementation of UserList in collections standard library, but keeping it to a minimum):

from collections.abc import Sequence
class Tuple(Sequence):
    def __init__(self, initlist=None):
        self.data = ()
        if initlist is not None:
            if type(initlist) == type(self.data):
                self.data = initlist
            elif isinstance(initlist, Tuple):
                self.data = initlist.data
            else:
                self.data = tuple(initlist)
    def __getitem__(self, i): return self.data[i]
    def __len__(self): return len(self.data)
    def __hash__(self): return hash(self.data)
    def __repr__(self): return repr(self.data)

Sequence.register(Tuple)

If I use this kind of object as keys in my dictionary, Pandas is forced to use the object as index, which stops it to generate a multi-index:

>>> a = {Tuple((1,2)): 1}
>>> pd.Series(a)
(1, 2)    1
dtype: int64

The dictionary looks as if the keys were tuples:

>>> a
{(1, 2): 1}

So far, so good. However, something strange happens:

>>> a[Tuple((1,2))]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-169-9641d6999f03> in <module>()
----> 1 a[Tuple((1,2))]

KeyError: (1, 2)

Why is this? As far as I understand, python dictionaries should locate the value by computing the hash of the given key, which my Tuple.__hash__() does consistently, by hashing its inner data . Then, why the key is not found?

I guess that I must implement some other method in my Tuple class, but I cannot see which one, or why.

You also need to implement either __eq__ or __cmp__ for being hashable :

An object is hashable if it has a hash value which never changes during its lifetime (it needs a __hash__( ) method), and can be compared to other objects (it needs an __eq__() or __cmp__() method). Hashable objects which compare equal must have the same hash value.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM