[英]Applying a function to two pandas DataFrames efficiently
I have two DatetimeIndexed DataFrames with identical indexes and column names. 我有两个具有相同索引和列名的DatetimeIndexed DataFrames。 and approximately 8.26 million rows and 44 columns each, the DataFrames are joined then groupby is applied using a 10 minute time interval giving approximately 6884 groups.
大约分别有826万行和44列,连接数据帧,然后使用10分钟的时间间隔应用groupby,从而产生大约6884个组。 The matching column pairs are then iterated over, a single value is returned for each group and column pair.
然后迭代匹配的列对,为每个组和列对返回单个值。
The solution below works and takes 34 mins on a Xeon E5-2697 v3 and all the DataFrames can fit in memory. 以下解决方案可以在Xeon E5-2697 v3上运行34分钟,并且所有DataFrame都可以容纳在内存中。
I reckon there should be a more efficient way of computing this with two DataFrames, perhaps using Dask? 我认为应该有一个使用两个DataFrame进行计算的更有效方法,也许使用Dask?
Although it is not clear to me how to do the time based groupby for a Dask DataFrame. 尽管我不清楚如何为Dask DataFrame执行基于时间的分组方式。
def circular_mean(burst_veldirection, burst_velspeed):
x = y = 0.
for angle, weight in zip(burst_veldirection.values, burst_velspeed.values):
x += math.cos(math.radians(angle)) * weight
y += math.sin(math.radians(angle)) * weight
mean = math.degrees(math.atan2(y, x))
if mean < 0:
mean = 360 + mean
return mean
def circ_mean(df):
results = []
for x in range(0,45):
results.append(circular_mean(df[str(x)], df[str(x) + 'velspeed']))
return results
burst_veldirection_velspeed = burst_veldirection.join(burst_velspeed, rsuffix='velspeed')
result = burst_veldirection_velspeed.groupby(pd.TimeGrouper(freq='10Min')).apply(circ_mean)
Example short HDF file containing the first 10,000 records covering 23 minutes 简短的HDF文件示例,包含前10,000条记录,涵盖23分钟
This doesn't get you away from groupby
, but just shifting over to numpy functions from doing everything element-wise gets a roughly 8-fold speed boost for me. 这并不
groupby
您脱离groupby
,而只是从逐个元素地执行numpy函数可以为我带来大约8倍的速度提升。
def circ_mean2(df):
df2 = df.iloc[:, 45:].copy()
df1 = df.iloc[:, :45].copy()
x = np.sum(np.cos(np.radians(df1.values))*df2.values, axis=0)
y = np.sum(np.sin(np.radians(df1.values))*df2.values, axis=0)
arctan = np.degrees(np.arctan2(y, x))
return np.where(arctan>0, arctan, arctan+360).tolist()
Comparison on 100 rows (random data): 100行比较(随机数据):
burst_veldirection_velspeed.groupby(pd.TimeGrouper(freq='10Min')).apply(circ_mean)
Out[546]:
2017-01-01 00:00:00 [107.1417250368678, 256.8946560151866, 213.146...
2017-01-01 00:10:00 [26.33395947005812, 27.786466256197127, 94.898...
2017-01-01 00:20:00 [212.56183600787307, 284.77924347375733, 241.7...
2017-01-01 00:30:00 [302.1659401891579, 91.1768853178421, 194.9664...
2017-01-01 00:40:00 [90.29680187822757, 337.4345622590224, 302.219...
2017-01-01 00:50:00 [94.88722975883893, 319.5580499260627, 204.511...
2017-01-01 01:00:00 [133.4980653288851, 55.16669017531442, 20.7527...
2017-01-01 01:10:00 [356.67045637546113, 151.25258425458003, 200.1...
2017-01-01 01:20:00 [350.2489907863962, 33.284286840600046, 145.66...
2017-01-01 01:30:00 [135.74199444105565, 62.66259615135012, 257.80...
Freq: 10T, dtype: object
burst_veldirection_velspeed.groupby(pd.TimeGrouper(freq='10Min')).apply(circ_mean2)
Out[547]:
2017-01-01 00:00:00 [107.1417236328125, 256.8946533203125, 213.146...
2017-01-01 00:10:00 [26.333953857421875, 27.78646469116211, 94.898...
2017-01-01 00:20:00 [212.5618438720703, 284.77923583984375, 241.72...
2017-01-01 00:30:00 [302.16595458984375, 91.1768798828125, 194.966...
2017-01-01 00:40:00 [90.29680633544922, 337.4345703125, 302.219909...
2017-01-01 00:50:00 [94.88722229003906, 319.55804443359375, 204.51...
2017-01-01 01:00:00 [133.498046875, 55.166690826416016, 20.7527561...
2017-01-01 01:10:00 [356.6704406738281, 151.25257873535156, 200.13...
2017-01-01 01:20:00 [350.2489929199219, 33.2842903137207, 145.6609...
2017-01-01 01:30:00 [135.7419891357422, 62.66258239746094, 257.807...
Freq: 10T, dtype: object
%timeit burst_veldirection_velspeed.groupby(pd.TimeGrouper(freq='10Min')).apply(circ_mean)
10 loops, best of 3: 80.3 ms per loop
%timeit burst_veldirection_velspeed.groupby(pd.TimeGrouper(freq='10Min')).apply(circ_mean2)
10 loops, best of 3: 10.4 ms per loop
On 10,000: 10,000个:
%timeit burst_veldirection_velspeed.groupby(pd.TimeGrouper(freq='10Min')).apply(circ_mean)
1 loop, best of 3: 6.65 s per loop
%timeit burst_veldirection_velspeed.groupby(pd.TimeGrouper(freq='10Min')).apply(circ_mean2)
1 loop, best of 3: 709 ms per loop
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.