熊猫：数据框和用于定义它的numpy.array之间的关系

Question

I just wanted to created two data frames of the same dimensions which where initially empty. 我只想创建两个相同尺寸的数据框，这些尺寸最初是空的。 I did it this way: 我这样做是这样的：

import numpy as np
import pandas as pd

m = np.empty((2, 3))*np.nan
df1 = pd.DataFrame(m)
df2 = pd.DataFrame(m)

But when I changed a particular value in one data frame, all three objects are affected: 但是，当我在一个数据帧中更改特定值时，所有三个对象都会受到影响：

df2.iloc[1, 2] = 1

print(df2)
    0   1    2
0 NaN NaN  NaN
1 NaN NaN  1.0

print(df1)
    0   1    2
0 NaN NaN  NaN
1 NaN NaN  1.0

print(m)
array([[nan, nan, nan],
       [nan, nan,  1.]])

So it seems that a data frame is just wrapper around an numpy array: no copy is made. 因此，似乎数据帧只是包裹在一个numpy数组周围：不进行任何复制。 I have not seen this behavior documented anywhere and I just wanted to point it out. 我没有在任何地方看到这种行为的记录，我只是想指出一点。 Any comments? 任何意见？

Answer 1

There is an init arg to DataFrame that let's you specify to copy data from ndarray to the DataFrame. DataFrame有一个初始化参数，让您指定将数据从ndarray复制到DataFrame。

See source code of pandas frame.py , line 405 and later... By default, copy is False. 请参见pandas frame.py的源代码，第405行及更高版本...默认情况下，copy为False。

So, you can force copying with something like: 因此，您可以通过以下方式强制进行复制：

import numpy as np
import pandas as pd

m = np.empty((2, 3))*np.nan
df1 = pd.DataFrame(m,copy=True)
df2 = pd.DataFrame(m)

df2.iloc[1, 2] = 1
print(df1)
print(df2)

Answer 2

I think that this happens because df1 and df2 are pointers to the same memory address. 我认为这是因为df1和df2是指向相同内存地址的指针。 If you're not familiar with pointers, see for example this . 如果您不熟悉指针，请参见this 。
A quick way to solve the problem is to copy the shared numpy array in a new array: 解决此问题的快速方法是将共享的numpy数组复制到新数组中：

 import numpy as np
import pandas as pd

m = np.empty((2, 3))*np.nan
n = m.copy()
df1 = pd.DataFrame(m)
df2 = pd.DataFrame(n)

df2.iloc[1, 2] = 1

print(df1)
print(df2)

Answer 3

The idea behind this behavior is that numpy and pandas are designed for efficiency. 这种行为背后的想法是，numpy和pandas是为了提高效率而设计的。 So the philosophy of developers is: contents is copied only when necessary . 因此，开发人员的理念是： 仅在必要时才复制内容 。

For example : 例如：

a=np.ones((2,3))
df=pd.DataFrame(a)
df.iloc[0,0]="string" 

In [2]: a
Out[2]: 
array([[ 1.,  1.,  1.],
       [ 1.,  1.,  1.]])

In [3]: df
Out[3]: 
        0    1    2
0  string  1.0  1.0
1       1  1.0  1.0

in this case a copy is made, since dtypes are changed. 在这种情况下，由于dtypes已更改，因此将进行复制。

熊猫：数据框和用于定义它的numpy.array之间的关系

问题描述

3 个解决方案

解决方案1
5 已采纳 2018-10-22 12:51:02

解决方案2
3 2018-10-22 12:21:59

解决方案3
2 2018-10-22 13:18:09

熊猫：数据框和用于定义它的numpy.array之间的关系

问题描述

3 个解决方案

解决方案1 5 已采纳 2018-10-22 12:51:02

解决方案2 3 2018-10-22 12:21:59

解决方案3 2 2018-10-22 13:18:09

解决方案1
5 已采纳 2018-10-22 12:51:02

解决方案2
3 2018-10-22 12:21:59

解决方案3
2 2018-10-22 13:18:09