熊猫数据框-python中的速度：数据框操作，numba，cython

Question

I have a financial dataset with ~2 million rows. 我有一个约200万行的财务数据集。 I would like to import it as a pandas dataframe and add additional columns by applying rowwise functions utilizing some of the existing column values. 我想将其导入为pandas数据框，并通过利用一些现有列值来应用行函数来添加其他列。 For this purpose I would like to not use any techniques like parallelization, hadoop for python, etc, and so I'm faced with the following: 为此，我不想使用任何技术，例如并行化，针对Python的hadoop等，因此我面临以下问题：

I am already doing this similar to the example below and performance is poor, ~24 minutes to just get through ~20K rows. 我已经在执行类似于以下示例的操作，并且性能很差，大约需要24分钟才能通过大约20K行。 Note: this is not the actual function, it is completely made up. 注意：这不是实际功能，它是完全组成的。 For the additional columns I am calculating various financial option metrics. 对于其他列，我正在计算各种财务期权指标。 I suspect the slow speed is primarily due to iterating over all the rows, not really the functions themselves as they are fairly simple (eg calculating price of an option). 我怀疑速度之慢主要是由于遍历所有行，而不是功能本身，因为它们相当简单（例如，计算期权价格）。 I know I can speed up little things in the functions themselves, such as using erf instead of the normal distribution, but for this purpose I want to focus on the holistic problem itself. 我知道我可以加快函数本身中的小事情，例如使用erf代替正态分布，但是为此，我想着重于整体问题本身。

def func(alpha, beta, time, vol):
    px = (alpha*beta)/time * vol
    return px

# Method 1 (could also use itertuples here) - this is the one that takes ~24 minutes now
for row in df.iterrows():
    df['px'][row] = func(alpha, beta, df['time'][row], df['vol'][row])

I have also tried vectorizing this but keep getting an error about 'cannot serialize float' or something like that. 我也曾尝试将其向量化，但始终收到有关“无法序列化浮动”或类似问题的错误。

My thought is to try one of the following methods, and I am not sure which one would theoretically be fastest? 我的想法是尝试以下方法之一，但我不确定理论上哪种方法最快？ Are there non-linearities associated with running these, such that a test with 1000 rows would not necessarily indicate which would be fastest across all 2 million rows? 是否存在与运行这些非线性相关的非线性关系，以至于有1000行的测试不一定表明在200万行中最快？ Probably a separate question, but should I focus on more efficient ways to manage the dataset rather than just focus on applying the functions? 可能是一个单独的问题，但是我应该专注于更有效的方法来管理数据集，而不是仅仅专注于应用函数吗？

# Alternative 1 (df.apply with existing function above)
df['px'] = df.apply(lambda row: func(alpha, beta, row['time'], row['vol']), axis=1)

# Alternative 2 (numba & jit)
@jit
def func(alpha, beta, time, vol):
    px = (alpha*beta)/time * vol
    return px

# Alternative 3 (cython)
def func_cython(double alpha, double beta, double time, double vol):
    cdef double px
    px = (alpha*beta)/time * vol
    return px

In the case of Cython and numba, would I still iterate over all the rows using df.apply? 对于Cython和numba，我是否仍会使用df.apply遍历所有行？ Or is there a more efficient way? 还是有更有效的方法？

I have referenced the following and found them to be helpful in understanding the various options, but not what the 'best' way is to do this (though I suppose it depends ultimately on the application). 我参考了以下内容，发现它们有助于理解各种选项，但并不是实现此目的的“最佳”方法（尽管我认为这最终取决于应用程序）。

https://lectures.quantecon.org/py/need_for_speed.html https://lectures.quantecon.org/py/need_for_speed.html

Numpy vs Cython speed Numpy vs Cython速度

Speeding up a numpy loop in python? 加快python中的numpy循环？

Cython optimization Cython优化

http://www.devx.com/opensource/improve-python-performance-with-cython.html http://www.devx.com/opensource/improve-python-performance-with-cython.html

Answer 1

How about simply: 简单地说：

df.loc[:, 'px'] = (alpha * beta) / df.loc[:, 'time'] * df.loc[:, 'vol']

By the way, your for-loop/lambda solutions are slow because the overhead for each pandas access is large. 顺便说一下，您的for-loop / lambda解决方案很慢，因为每个熊猫访问的开销都很大。 So accessing each cell separately (via looping over each row) is much slower than accessing the whole column. 因此，分别访问每个单元格（通过遍历每一行）比访问整个列要慢得多。

熊猫数据框-python中的速度：数据框操作，numba，cython

问题描述

1 个解决方案

解决方案1
6 已采纳 2017-05-01 19:38:43

熊猫数据框-python中的速度：数据框操作，numba，cython

问题描述

1 个解决方案

解决方案1 6 已采纳 2017-05-01 19:38:43

解决方案1
6 已采纳 2017-05-01 19:38:43