简体   繁体   中英

Why id of a pandas dataframe cell changes with each execution?

I ran into this problem when I was trying to make sure some properties of data frame's view.

Suppose I have a dataframe defined as: df = pd.DataFrame(columns=list('abc'), data=np.arange(18).reshape(6, 3)) and a view of this dataframe defined as: df1 = df.iloc[:3, :] . We now have two dataframes as following:

print(df)
    a   b   c
0   0   1   2
1   3   4   5
2   6   7   8
3   9  10  11
4  12  13  14
5  15  16  17

print(df1)

   a  b  c
0  0  1  2
1  3  4  5
2  6  7  8

Now I want to output the id of a particular cell of these two dataframes:

print(id(df.loc[0, 'a']))
print(id(df1.loc[0, 'a']))

and I have the output as:

140114943491408
140114943491408

The weird thing is, if I continuously execute those two lines of 'print id' code, the ids change as well:

140114943491480
140114943491480

I have to emphasize that I did not execute the 'df definition' code when I execute those two 'print id' code, so the df and df1 are not redefined. Then, in my opinion, the memory address of each element in the data frame should be fixed, so how could the output changes?

A more weird thing happens when I keep executing those two lines of 'print id' codes. In some rare scenarios, those two ids even do not equal to each other:

140114943181088
140114943181112

But if I execute id(df.loc[0, 'a']) == id(df1.loc[0, 'a']) at the same time, python still output True . I know that since df1 is a view of df, their cells should share one memory, but how come the output of their ids could be different occasionally?

Those strange behaviors make me totally at lost. Could anyone explain those behaviors? Are they due to the characteristics of data frame or the id function in python? Thanks!

FYI, I am using Python 3.5.2 .

You are not getting the id of a "cell", you are getting the id of the object returned by the .loc accessor, which is a boxed version of the underlying data.

So,

>>> import pandas as pd
>>> df = pd.DataFrame(columns=list('abc'), data=np.arange(18).reshape(6, 3))
>>> df1 = df.iloc[:3, :]
>>> df.dtypes
a    int64
b    int64
c    int64
dtype: object
>>> df1.dtypes
a    int64
b    int64
c    int64
dtype: object

But since everything in Python is an object, your loc method must return an object:

>>> x = df.loc[0, 'a']
>>> x
0
>>> type(x)
<class 'numpy.int64'>
>>> isinstance(x, object)
True

However, the actual underlying buffer is a primitive array of C fixed-size 64-bit signed integers. They are not Python objects, they are "boxed" to borrow a term from other languages which mix primitive types with objects.

Now, the phenomenon you are seeing with all objects having the same id :

>>> id(df.loc[0, 'a']), id(df.loc[0, 'a'])
(4539673432, 4539673432)
>>> id(df.loc[0, 'a']), id(df.loc[0, 'a']), id(df1.loc[0,'a'])
(4539673432, 4539673432, 4539673432)

Occurs because in Python, objects are free to re-use the memory address of recently reclaimed objects. Indeed, when you create your tuple of id 's, the object's returned by loc only exist long enough to get passed and processed by the first invocation of id , the second time you use loc , the object, already deallocated, simply re-uses the same memory. You can see the same behavior with any Python object, like a list :

>>> id([]), id([])
(4545276872, 4545276872)

Fundamentally, id 's are only guaranteed to be unique for the lifetime of the object. Read more about this phenomenon here . But, note, in the following case, it will always be different:

>>> x = df.loc[0, 'a']
>>> x2 = df.loc[0, 'a']
>>> id(x), id(x2)
(4539673432, 4539673408)

Since you maintain references around, the objects are not reclaimed, and require new memory.

Note, for many immutable objects, the interpreter is free to optimize and return the same exact object . In CPython, this is the case with "small ints", the so called small-int cache:

>>> x = 2
>>> y = 2
>>> id(x), id(y)
(4304820368, 4304820368)

But this is an implementation detail that should not be relied upon.

If you want to prove to yourself that your data-frames are sharing the same underlying buffer, just mutate them and you'll see the same change reflected across views:

>>> df
    a   b   c
0   0   1   2
1   3   4   5
2   6   7   8
3   9  10  11
4  12  13  14
5  15  16  17
>>> df1
   a  b  c
0  0  1  2
1  3  4  5
2  6  7  8
>>> df.loc[0, 'a'] = 99
>>> df
    a   b   c
0  99   1   2
1   3   4   5
2   6   7   8
3   9  10  11
4  12  13  14
5  15  16  17
>>> df1
    a  b  c
0  99  1  2
1   3  4  5
2   6  7  8

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM