[英]Vectorizing an iterative function on Pandas DataFrame
I have a dataframe where the first row is the initial condition.我有一个 dataframe,其中第一行是初始条件。
df = pd.DataFrame({"Year": np.arange(4),
"Pop": [0.4] + [np.nan]* 3})
and a function f(x,r) = r*x*(1-x)
, where r = 2
is a constant and 0 <= x <= 1
.和 function
f(x,r) = r*x*(1-x)
,其中r = 2
是常数, 0 <= x <= 1
。
I want to produce the following dataframe by applying the function to column Pop
row-by-row iteratively.我想通过逐行迭代地将 function 应用于
Pop
列来生成以下 dataframe。 Ie, df.Pop[i] = f(df.Pop[i-1], r=2)
即
df.Pop[i] = f(df.Pop[i-1], r=2)
df = pd.DataFrame({"Year": np.arange(4),
"Pop": [0.4, 0.48, 4992, 0.49999872]})
Question: Is it possible to do this in a vectorized way?问题:是否有可能以矢量化的方式做到这一点?
I can achieve the desired result by using a loop to build lists for the x and y values, but this is not vectorized.我可以通过使用循环为 x 和 y 值构建列表来实现所需的结果,但这不是矢量化的。
I have also tried this, but all nan
places are filled with 0.48
.我也试过这个,但所有
nan
地方都充满了0.48
。
df.loc[1:, "Pop"] = R * df.Pop[:-1] * (1 - df.Pop[:-1])
It is IMPOSSIBLE to do this in a vectorized way.以矢量化方式执行此操作是不可能的。
By definition, vectorization makes use of parallel processing to reduce execution time.根据定义,矢量化利用并行处理来减少执行时间。 But the desired values in your question must be computed in sequential order, not in parallel .
但是您问题中的所需值必须按顺序计算,而不是并行计算。 See this answer for detailed explanation.
有关详细说明,请参阅此答案。 Things like df.expanding(2).apply(f) and df.rolling(2).apply(f) won't work.
像df.expanding(2).apply(f)和df.rolling(2).apply(f)这样的东西是行不通的。
However, gaining more efficiency is possible.然而,获得更高的效率是可能的。 You can do the iteration using a generator.
您可以使用生成器进行迭代。 This is a very common construct for implementing iterative processes.
这是实现迭代过程的一个非常常见的结构。
def gen(x_init, n, R=2):
x = x_init
for _ in range(n):
x = R * x * (1-x)
yield x
# execute
df.loc[1:, "Pop"] = list(gen(df.at[0, "Pop"], len(df) - 1))
Result:结果:
print(df)
Pop
0 0.400000
1 0.480000
2 0.499200
3 0.499999
It is completely OK to stop here for small-sized data.小数据完全可以到此为止。 If the function is going to be performed a lot of times, however, you can consider optimizing the generator with numba .
但是,如果 function 将执行很多次,您可以考虑使用numba优化生成器。
pip install numba
or conda install numba
in the console first pip install numba
or conda 首先在控制台conda install numba
import numba
@numba.njit
in front of the generator.@numba.njit
。 Change the number of np.nan
s to 10^6 and check out the difference in execution time yourself.将
np.nan
的个数改为 10^6 ,自己查看执行时间的差异。 An improvement from 468ms to 217ms was achieved on my Core-i5 8250U 64bit laptop.在我的 Core-i5 8250U 64 位笔记本电脑上实现了从 468 毫秒到 217 毫秒的改进。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.