[英]Fastest way to copy columns from one DataFrame to another using pandas?
I have a large DataFrame (million +) records I'm using to store core of my data (like a database) and I then have a smaller DataFrame (1 to 2000) records that I'm combining a few of the columns for each time step in my program which can be several thousand time steps .我有一个大的 DataFrame(百万+)记录,我用它来存储我的数据的核心(如数据库),然后我有一个较小的 DataFrame(1 到 2000)记录,我为每个记录组合了一些列我的程序中的时间步长可能是几千个时间步长。 Both DataFrames are indexed the same way by a id column.
两个 DataFrame 都通过 id 列以相同的方式索引。
the code I'm using is:我使用的代码是:
df_large.loc[new_ids, core_cols] = df_small.loc[new_ids, core_cols]
Where core_cols is a list of about 10 fields that I'm coping over and new_ids are the ids from the small DataFrame.其中 core_cols 是我正在处理的大约 10 个字段的列表,而 new_ids 是来自小型 DataFrame 的 ID。 This code works fine but it is the slowest part of my code my a magnitude of three.
这段代码工作正常,但它是我代码中最慢的部分,幅度为 3。 I just wanted to know if they was a faster way to merge the data of the two DataFrame together.
我只是想知道它们是否是将两个 DataFrame 的数据合并在一起的更快方法。
I tried merging the data each time with the merge function but process took way to long that is way I have gone to creating a larger DataFrame that I update to improve the speed.我每次都尝试使用合并功能合并数据,但过程花费了很长时间,这就是我创建一个更大的 DataFrame 的方式,我更新它以提高速度。
There is nothing inherently slow about using .loc
to set with an alignable frame, though it does go through a bit of code to cover lot of cases, so probably it's not ideal to have in a tight loop. 使用
.loc
设置可对齐框架没有任何内在的缓慢,虽然它确实通过一些代码来覆盖很多情况,所以可能在紧密循环中并不理想。 FYI, this example is slightly different that the 2nd example. 仅供参考,这个例子与第二个例子略有不同。
In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: from pandas import DataFrame
In [4]: df = DataFrame(1.,index=list('abcdefghij'),columns=[0,1,2])
In [5]: df
Out[5]:
0 1 2
a 1 1 1
b 1 1 1
c 1 1 1
d 1 1 1
e 1 1 1
f 1 1 1
g 1 1 1
h 1 1 1
i 1 1 1
j 1 1 1
[10 rows x 3 columns]
In [6]: df2 = DataFrame(0,index=list('afg'),columns=[1,2])
In [7]: df2
Out[7]:
1 2
a 0 0
f 0 0
g 0 0
[3 rows x 2 columns]
In [8]: df.loc[df2.index,df2.columns] = df2
In [9]: df
Out[9]:
0 1 2
a 1 0 0
b 1 1 1
c 1 1 1
d 1 1 1
e 1 1 1
f 1 0 0
g 1 0 0
h 1 1 1
i 1 1 1
j 1 1 1
[10 rows x 3 columns]
Here's an alternative. 这是另一种选择。 It may or may not fit your data pattern.
它可能适合您的数据模式,也可能不适合您。 If the updates (your small frame) are pretty much independent this would work (IOW you are not updating the big frame, then picking out a new sub-frame, then updating, etc. - if this is your pattern, then using
.loc
is about right). 如果更新(你的小框架)非常独立,这将起作用(如果你没有更新大框架,然后选择一个新的子框架,然后更新等等 - 如果这是你的模式,那么使用
.loc
是对的)。
Instead of updating the big frame, update the small frame with the columns from the big frame, eg: 不要更新大框架,而是使用大框架中的列更新小框架,例如:
In [10]: df = DataFrame(1.,index=list('abcdefghij'),columns=[0,1,2])
In [11]: df2 = DataFrame(0,index=list('afg'),columns=[1,2])
In [12]: needed_columns = df.columns-df2.columns
In [13]: df2[needed_columns] = df.reindex(index=df2.index,columns=needed_columns)
In [14]: df2
Out[14]:
1 2 0
a 0 0 1
f 0 0 1
g 0 0 1
[3 rows x 3 columns]
In [15]: df3 = DataFrame(0,index=list('cji'),columns=[1,2])
In [16]: needed_columns = df.columns-df3.columns
In [17]: df3[needed_columns] = df.reindex(index=df3.index,columns=needed_columns)
In [18]: df3
Out[18]:
1 2 0
c 0 0 1
j 0 0 1
i 0 0 1
[3 rows x 3 columns]
And concat everything together when you want (they are kept in a list in the mean time, or see my comments below, these sub-frames could be moved to external storage when created, then read back before this concatenating step). 并在需要时将所有内容连接在一起(它们在同一时间保存在列表中,或者在下面看到我的注释,这些子帧可以在创建时移动到外部存储,然后在此连接步骤之前回读)。
In [19]: pd.concat([ df.reindex(index=df.index-df2.index-df3.index), df2, df3]).reindex_like(df)
Out[19]:
0 1 2
a 1 0 0
b 1 1 1
c 1 0 0
d 1 1 1
e 1 1 1
f 1 0 0
g 1 0 0
h 1 1 1
i 1 0 0
j 1 0 0
[10 rows x 3 columns]
The beauty of this pattern is that it is easily extended to using an actual db (or much better an HDFStore
), to actually store the 'database', then creating/updating sub-frames as needed, then writing out to a new store when finished. 这种模式的
HDFStore
在于它很容易扩展到使用实际的数据库(或更好的HDFStore
),实际存储'数据库',然后根据需要创建/更新子帧,然后写入新的商店时完了。
I use this pattern all of the time, though with Panels actually. 我一直使用这种模式,但实际上是使用Panels。
I am able to use multi-processes to perform my computations AND write each individual Panel to a file separate as they are all completely independent. 我能够使用多进程来执行我的计算并将每个单独的Panel写成一个单独的文件,因为它们都是完全独立的。 The only dependent part is the concat.
唯一的依赖部分是concat。
This is essentially a map-reduce pattern. 这基本上是一个map-reduce模式。
I've had to copy between large dataframes a fair bit. 我不得不在大型数据帧之间进行复制。 I'm using dataframes with realtime market data, which may not be what pandas is designed for, but this is my experience..
我正在使用具有实时市场数据的数据帧,这可能不是大熊猫的设计目标,但这是我的经验..
On my pc, copying a single datapoint with .at
takes 15µs with the df size making negligible difference. 在我的电脑上,用
.at
复制一个数据点需要15μs,df大小可以忽略不计。 .loc
takes a minimum of 550µs and increases as the df gets larger: 3100µs to copy a single point from one 100000x2 df to another. .loc
至少需要550μs并随着df变大而增加:3100μs将单个点从一个100000x2 df复制到另一个100000x2 df。 .ix
seems to be just barely faster than .loc
. .ix
似乎只比.loc
快.loc
。
For a single datapoint .at
is very fast and is not impacted by the size of the dataframe, but it cannot handle ranges so loops are required, and as such the time scaling is linear. 对于单个数据点
.at
非常快并且不受数据帧大小的影响,但它无法处理范围因此需要循环,因此时间缩放是线性的。 .loc
and .ix
on the other hand are (relatively) very slow for single datapoints, but they can handle ranges and scale up better than linearly. 另一方面,
.loc
和.ix
对于单个数据点(相对)非常慢,但是它们可以处理范围并且比线性扩展更好。 However, unlike .at
they slow down significantly wrt dataframe size. 但是,与
.at
不同,它们会大幅减慢数据帧大小。
Therefore when I'm frequently copying small ranges between large dataframes, I tend to use .at with a for loop, and otherwise I use .ix
with a range. 因此,当我经常在大型数据帧之间复制小范围时,我倾向于使用带有for循环的
.ix
,否则我将.ix
与范围一起使用。
for new_id in new_ids:
for core_col in core_cols:
df_large.at[new_id, core_col] = df_small.at[new_id, core_col]
Of course, to do it properly I'd go with Jeff's solution above, but it's nice to have options. 当然,为了正确地做到这一点,我会选择上面的Jeff的解决方案,但是有选择是很好的。
Caveats of .at
: it doesn't work with ranges, and it doesn't work if the dtype is datetime (and maybe others). 注意事项:
.at
:它不适用于范围,如果dtype是datetime(也许是其他人),它就不起作用。
快速:将列a
和b
从旧 df 复制到新 df 中。
df1 = df[['a', 'b']]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.