简体   繁体   English

矢量化计算Pandas Dataframe

[英]Vectorize calculation of a Pandas Dataframe

I have a trivial problem that I have solved using loops, but I am trying to see if there is a way I can attempt to vectorize some of it to try and improve performance. 我有一个小问题,我已经解决了使用循环,但我试图看看是否有一种方法,我可以尝试向量化它的一些尝试和提高性能。

Essentially I have 2 dataframes (DF_A and DF_B), where the rows in DF_B are based on a sumation of a corresponding row in DF_A and the row above in DF_B. 基本上我有2个数据帧(DF_A和DF_B),其中DF_B中的行基于DF_A中相应行的内容和DF_B中的上一行。 I do have the first row of values in DF_B. 我确实在DF_B中有第一行值。

df_a = [
  [1,2,3,4]
  [5,6,7,8]
  [..... more rows]
]
df_b = [
 [1,2,3,4] 
 [ rows of all 0 values here, so dimensions match df_a]
]

What I am trying to achive is that the 2nd row in df_b for example will be the values of the first row in df_b + the values of the second row in df_a. 我想要实现的是,例如df_b中的第二行将是df_b中第一行的值+ df_a中第二行的值。 So in this case: 所以在这种情况下:

df_b.loc[2] = [6,8,10,12] 

I was able to accomplish this using a loop over range of df_a, keeping the previous rows value saved off and then adding the row of the current index to the previous rows value. 我能够使用df_a范围内的循环来完成此操作,保持先前的行值保存,然后将当前索引的行添加到前一行值。 Doesn't seem super efficient. 看起来效率不高。

Here is a numpy solution. 这是一个numpy解决方案。 This should be significantly faster than a pandas loop, especially since it uses JIT-compiling via numba . 这应该比pandas循环快得多,特别是因为它通过numba使用JIT编译。

from numba import jit

a = df_a.values
b = df_b.values

@jit(nopython=True)
def fill_b(a, b):
    for i in range(1, len(b)):
        b[i] = b[i-1] + a[i]
    return b

df_b = pd.DataFrame(fill_b(a, b))

#     0   1   2   3
# 0   1   2   3   4
# 1   6   8  10  12
# 2  15  18  21  24
# 3  28  32  36  40
# 4  45  50  55  60

Performance benchmarking 绩效基准

import pandas as pd, numpy as np
from numba import jit

df_a = pd.DataFrame(np.arange(1,1000001).reshape(1000,1000))

@jit(nopython=True)
def fill_b(a, b):
    for i in range(1, len(b)):
        b[i] = b[i-1] + a[i]
    return b

def jp(df_a):

    a = df_a.values
    b = np.empty(df_a.values.shape)
    b[0] = np.arange(1, 1001)

    return pd.DataFrame(fill_b(a, b))

%timeit df_a.cumsum()  # 16.1 ms
%timeit jp(df_a)       # 6.05 ms

You can just create df_b using the cumulative sum over df_a , like so 您可以使用df_b的累积总和创建df_a ,就像这样

df_a = pd.DataFrame(np.arange(1,17).reshape(4,4))
df_b = df_a.cumsum()

    0   1   2   3
0   1   2   3   4
1   6   8  10  12
2  15  18  21  24
3  28  32  36  40

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM