[英]Speed up applying function to a list of pandas dataframes
I have some time series data where each data point is a pandas data frame ie list of data frames. 我有一些时间序列数据,其中每个数据点是一个pandas数据帧,即数据帧列表。 I also have a function
foo
which operates on each data point. 我还有一个函数
foo
,它对每个数据点进行操作。 The aim is to apply the function over the entire time series data and do it efficiently. 目的是将功能应用于整个时间序列数据并有效地执行。
I have vectorized the function foo
so that it operates on the entire data frame and achieved a speedup of around 32x. 我已经将函数
foo
向量化,以便它在整个数据帧上运行,并实现了大约32倍的加速。
The original code is as follows: 原始代码如下:
def bar(row, cols):
return tuple([row[col] for col in cols])
def foo(df, cols):
keys = set()
for index, row in df.iterrows():
key = bar(row, cols)
keys.add(key)
# do calculations on keys that returns a numeric output, result
return result # float64
The vectorized code is as follows: 矢量化代码如下:
def vect_bar(df, cols):
df['key'] = df[cols].values.sum(axis=1)
return df
def vect_foo(df, cols):
df['key'] = ""
df = vect_bar(df, cols)
keys = df.key.unique()
# do calculations on keys that returns a numeric output, result
return result # float64
The timing results are as follows: 时间结果如下:
%timeit -n 100 foo(df, cols)
100 loops, best of 3: 42.9 ms per loop
%timeit -n 100 vect_foo(df, cols)
100 loops, best of 3: 1.34 ms per loop
Note: cols
is a list of column names. 注意:
cols
是列名列表。 All the elements of the data frame are strings and of dtype object. 数据框的所有元素都是字符串和dtype对象。
However, it still takes a long time to apply vect_foo
to all the data points. 但是,将
vect_foo
应用于所有数据点仍需要很长时间。 How can I speed it up further? 我怎样才能进一步加快速度?
I tried creating a pandas series from the data and using series.apply()
. 我尝试从数据创建一个pandas系列并使用
series.apply()
。 However, that did not cause any speedup from the regular for loop approach. 但是,这并没有导致常规for循环方法的任何加速。
EDIT : If I was not clear earlier, given a data frame, the function vect_foo
is quite efficient. 编辑 :如果我之前不清楚,给定一个数据框,函数
vect_foo
非常有效。 What I want is a way to speed up applying vect_foo
to all the data points ie the list of data frames. 我想要的是一种加速将
vect_foo
应用于所有数据点即数据帧列表的方法。
data_series = pd.Series(data)
def apply_data():
return data_series.apply(vect_foo, cols)
data
is a list of pandas data frames ie data = [df1, df2, ..., df50K]
data
是一个pandas数据帧列表,即data = [df1, df2, ..., df50K]
Here, I tried pandas.Series.apply()
but it performed similarly to a normal for loop approach. 在这里,我尝试了
pandas.Series.apply()
但它的执行方式类似于普通的for循环方法。
import pandas as pd
def foo(row, cols):
row['keys'] = row[cols].sum()
return row
df.apply(foo, axis=1)
just create your helper function and use the apply
function. 只需创建辅助函数并使用
apply
函数。 this is usually the most efficient way to apply a function across rows/columns in pandas
这通常是在
pandas
中pandas
/列应用函数的最有效方法
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.