简体   繁体   English

熊猫:数据框和用于定义它的numpy.array之间的关系

[英]Pandas: Relationship between a data frame and the numpy.array used to define it

I just wanted to created two data frames of the same dimensions which where initially empty. 我只想创建两个相同尺寸的数据框,这些尺寸最初是空的。 I did it this way: 我这样做是这样的:

import numpy as np
import pandas as pd

m = np.empty((2, 3))*np.nan
df1 = pd.DataFrame(m)
df2 = pd.DataFrame(m)

But when I changed a particular value in one data frame, all three objects are affected: 但是,当我在一个数据帧中更改特定值时,所有三个对象都会受到影响:

df2.iloc[1, 2] = 1

print(df2)
    0   1    2
0 NaN NaN  NaN
1 NaN NaN  1.0

print(df1)
    0   1    2
0 NaN NaN  NaN
1 NaN NaN  1.0

print(m)
array([[nan, nan, nan],
       [nan, nan,  1.]])

So it seems that a data frame is just wrapper around an numpy array: no copy is made. 因此,似乎数据帧只是包裹在一个numpy数组周围:不进行任何复制。 I have not seen this behavior documented anywhere and I just wanted to point it out. 我没有在任何地方看到这种行为的记录,我只是想指出一点。 Any comments? 任何意见?

There is an init arg to DataFrame that let's you specify to copy data from ndarray to the DataFrame. DataFrame有一个初始化参数,让您指定将数据从ndarray复制到DataFrame。

See source code of pandas frame.py , line 405 and later... By default, copy is False. 请参见pandas frame.py的源代码,第405行及更高版本...默认情况下,copy为False。

So, you can force copying with something like: 因此,您可以通过以下方式强制进行复制:

import numpy as np
import pandas as pd

m = np.empty((2, 3))*np.nan
df1 = pd.DataFrame(m,copy=True)
df2 = pd.DataFrame(m)

df2.iloc[1, 2] = 1
print(df1)
print(df2)

I think that this happens because df1 and df2 are pointers to the same memory address. 我认为这是因为df1df2是指向相同内存地址的指针。 If you're not familiar with pointers, see for example this . 如果您不熟悉指针,请参见this
A quick way to solve the problem is to copy the shared numpy array in a new array: 解决此问题的快速方法是将共享的numpy数组复制到新数组中:

 import numpy as np
import pandas as pd

m = np.empty((2, 3))*np.nan
n = m.copy()
df1 = pd.DataFrame(m)
df2 = pd.DataFrame(n)

df2.iloc[1, 2] = 1

print(df1)
print(df2)

The idea behind this behavior is that numpy and pandas are designed for efficiency. 这种行为背后的想法是,numpy和pandas是为了提高效率而设计的。 So the philosophy of developers is: contents is copied only when necessary . 因此,开发人员的理念是: 仅在必要时才复制内容

For example : 例如 :

a=np.ones((2,3))
df=pd.DataFrame(a)
df.iloc[0,0]="string" 

In [2]: a
Out[2]: 
array([[ 1.,  1.,  1.],
       [ 1.,  1.,  1.]])

In [3]: df
Out[3]: 
        0    1    2
0  string  1.0  1.0
1       1  1.0  1.0

in this case a copy is made, since dtypes are changed. 在这种情况下,由于dtypes已更改,因此将进行复制。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM