如何串聯熊貓DataFrame而不復制數據？

Question

我想串聯兩個熊貓DataFrame而不復制數據。 也就是說，我希望串聯的DataFrame是兩個原始DataFrame中數據的視圖。 我嘗試使用concat（），但是沒有用。 此代碼塊顯示更改基礎數據會影響串聯的兩個DataFrame，但不會影響串聯的DataFrame：

arr = np.random.randn(12).reshape(6, 2)
df = pd.DataFrame(arr, columns = ('VALE5', 'PETR4'), index = dates)
arr2 = np.random.randn(12).reshape(6, 2)
df2 = pd.DataFrame(arr, columns = ('AMBV3', 'BBDC4'), index = dates)
df_concat = pd.concat(dict(A = df, B = df2),axis=1)
pp(df)
pp(df_concat)
arr[0, 0] = 9999999.99
pp(df)
pp(df_concat)

這是最后五行的輸出。 在將新值分配給arr [0，0]之后df改變； df_concat不受影響。

In [56]: pp(df)
           VALE5     PETR4
2013-01-01 -0.557180  0.170073
2013-01-02 -0.975797  0.763136
2013-01-03 -0.913254  1.042521
2013-01-04 -1.973013 -2.069460
2013-01-05 -1.259005  1.448442
2013-01-06 -0.323640  0.024857

In [57]: pp(df_concat)
               A                   B          
           VALE5     PETR4     AMBV3     BBDC4
2013-01-01 -0.557180  0.170073 -0.557180  0.170073
2013-01-02 -0.975797  0.763136 -0.975797  0.763136
2013-01-03 -0.913254  1.042521 -0.913254  1.042521
2013-01-04 -1.973013 -2.069460 -1.973013 -2.069460
2013-01-05 -1.259005  1.448442 -1.259005  1.448442
2013-01-06 -0.323640  0.024857 -0.323640  0.024857

In [58]: arr[0, 0] = 9999999.99

In [59]: pp(df)
                 VALE5     PETR4
2013-01-01  9999999.990000  0.170073
2013-01-02       -0.975797  0.763136
2013-01-03       -0.913254  1.042521
2013-01-04       -1.973013 -2.069460
2013-01-05       -1.259005  1.448442
2013-01-06       -0.323640  0.024857

In [60]: pp(df_concat)
               A                   B          
           VALE5     PETR4     AMBV3     BBDC4
2013-01-01 -0.557180  0.170073 -0.557180  0.170073
2013-01-02 -0.975797  0.763136 -0.975797  0.763136
2013-01-03 -0.913254  1.042521 -0.913254  1.042521
2013-01-04 -1.973013 -2.069460 -1.973013 -2.069460
2013-01-05 -1.259005  1.448442 -1.259005  1.448442
2013-01-06 -0.323640  0.024857 -0.323640  0.024857

我猜這意味着concat（）創建了數據的副本。 有沒有辦法避免制作副本？ （我想最小化內存使用）。

另外，有沒有一種快速的方法來檢查兩個DataFrame是否鏈接到相同的基礎數據？ （無需經歷更改數據和檢查每個DataFrame是否已更改的麻煩）

謝謝您的幫助。

FS

Answer 1

您不能（至少很容易）。 調用concat ，最終將調用np.concatenate 。

請參閱此答案，以解釋為什么不復制就無法連接數組。 缺點是不能保證數組在內存中是連續的。

這是一個簡單的例子

a = rand(2, 10)
x, y = a
z = vstack((x, y))
print 'x.base is a and y.base is a ==', x.base is a and y.base is a
print 'x.base is z or y.base is z ==', x.base is z or y.base is z

輸出：

x.base is a and y.base is a == True
x.base is z or y.base is z == False

即使x和y共享相同的base （即a ，但concatenate （因此vstack ）也不能假設它們這樣做，因為人們經常想要級聯任意跨度的數組。

您可以輕松地生成兩個步長不同的兩個數組，它們共享相同的內存，如下所示：

a = arange(10)
b = a[::2]
print a.strides
print b.strides

輸出：

(8,)
(16,)

這就是為什么發生以下情況的原因：

In [214]: a = arange(10)

In [215]: a[::2].view(int16)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-215-0366fadb1128> in <module>()
----> 1 a[::2].view(int16)

ValueError: new type not compatible with array.

In [216]: a[::2].copy().view(int16)
Out[216]: array([0, 0, 0, 0, 2, 0, 0, 0, 4, 0, 0, 0, 6, 0, 0, 0, 8, 0, 0, 0], dtype=int16)

編輯：當df1.dtype != df2.dtype不會使用副本時pd.merge(df1, df2, copy=False)使用pd.merge(df1, df2, copy=False) （或df1.merge(df2, copy=False) ）。 否則，將進行復制。

如何串聯熊貓DataFrame而不復制數據？

問題描述

1 個解決方案

解決方案1
2 2013-08-18 05:41:56

如何串聯熊貓DataFrame而不復制數據？

問題描述

1 個解決方案

解決方案1 2 2013-08-18 05:41:56

解決方案1
2 2013-08-18 05:41:56