简体   繁体   English

向量化迭代 function on Pandas DataFrame

[英]Vectorizing an iterative function on Pandas DataFrame

I have a dataframe where the first row is the initial condition.我有一个 dataframe,其中第一行是初始条件。

df = pd.DataFrame({"Year": np.arange(4),
                   "Pop": [0.4] + [np.nan]* 3})

and a function f(x,r) = r*x*(1-x) , where r = 2 is a constant and 0 <= x <= 1 .和 function f(x,r) = r*x*(1-x) ,其中r = 2是常数, 0 <= x <= 1

I want to produce the following dataframe by applying the function to column Pop row-by-row iteratively.我想通过逐行迭代地将 function 应用于Pop列来生成以下 dataframe。 Ie, df.Pop[i] = f(df.Pop[i-1], r=2)df.Pop[i] = f(df.Pop[i-1], r=2)

df = pd.DataFrame({"Year": np.arange(4),
                   "Pop": [0.4, 0.48, 4992, 0.49999872]})

Question: Is it possible to do this in a vectorized way?问题:是否有可能以矢量化的方式做到这一点?

I can achieve the desired result by using a loop to build lists for the x and y values, but this is not vectorized.我可以通过使用循环为 x 和 y 值构建列表来实现所需的结果,但这不是矢量化的。

I have also tried this, but all nan places are filled with 0.48 .我也试过这个,但所有nan地方都充满了0.48

df.loc[1:, "Pop"] = R * df.Pop[:-1] * (1 - df.Pop[:-1])

It is IMPOSSIBLE to do this in a vectorized way.以矢量化方式执行此操作是不可能的。

By definition, vectorization makes use of parallel processing to reduce execution time.根据定义,矢量化利用并行处理来减少执行时间。 But the desired values in your question must be computed in sequential order, not in parallel .但是您问题中的所需值必须按顺序计算,而不是并行计算。 See this answer for detailed explanation.有关详细说明,请参阅此答案 Things like df.expanding(2).apply(f) and df.rolling(2).apply(f) won't work.df.expanding(2).apply(f)df.rolling(2).apply(f)这样的东西是行不通的。

However, gaining more efficiency is possible.然而,获得更高的效率是可能的。 You can do the iteration using a generator.您可以使用生成器进行迭代。 This is a very common construct for implementing iterative processes.这是实现迭代过程的一个非常常见的结构。

def gen(x_init, n, R=2):
    x = x_init
    for _ in range(n):
        x = R * x * (1-x)
        yield x

# execute            
df.loc[1:, "Pop"] = list(gen(df.at[0, "Pop"], len(df) - 1))

Result:结果:

print(df)
        Pop
0  0.400000
1  0.480000
2  0.499200
3  0.499999

It is completely OK to stop here for small-sized data.小数据完全可以到此为止。 If the function is going to be performed a lot of times, however, you can consider optimizing the generator with numba .但是,如果 function 将执行很多次,您可以考虑使用numba优化生成器。

  • pip install numba or conda install numba in the console first pip install numba or conda 首先在控制台conda install numba
  • import numba
  • Add decorator @numba.njit in front of the generator.在生成器前面添加装饰器@numba.njit

Change the number of np.nan s to 10^6 and check out the difference in execution time yourself.np.nan的个数改为 10^6 ,自己查看执行时间的差异。 An improvement from 468ms to 217ms was achieved on my Core-i5 8250U 64bit laptop.在我的 Core-i5 8250U 64 位笔记本电脑上实现了从 468 毫秒到 217 毫秒的改进。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM