[英]How to vectorize pandas code where it depends on previous row?
I am trying to vectorize a code snippet in pandas:我正在尝试对 pandas 中的代码片段进行矢量化:
I have a pandas dataframe generated like this:我有一个像这样生成的 pandas dataframe :
ids![]() |
ftest![]() |
vals![]() |
|
---|---|---|---|
0 ![]() |
Q52EG ![]() |
0 ![]() |
0 ![]() |
1 ![]() |
Q52EG ![]() |
0 ![]() |
1 ![]() |
2 ![]() |
Q52EG ![]() |
1 ![]() |
2 ![]() |
3 ![]() |
Q52EG ![]() |
1 ![]() |
3 ![]() |
4 ![]() |
Q52EG ![]() |
1 ![]() |
4 ![]() |
5 ![]() |
QQ8Q4 ![]() |
0 ![]() |
5 ![]() |
6 ![]() |
QQ8Q4 ![]() |
0 ![]() |
6 ![]() |
7 ![]() |
QQ8Q4 ![]() |
1 ![]() |
7 ![]() |
8 ![]() |
QQ8Q4 ![]() |
1 ![]() |
8 ![]() |
9 ![]() |
QVIPW ![]() |
1 ![]() |
9 ![]() |
If any id in ids
column has a value 1 in the ftest
column, then all the subsequent rows with same id should be marked as 1 in has_hist
column and it doesnt depend on the current ftest
value as shown in the dataframe below:如果
ids
列中的任何 id 在has_hist
列中具有值 1,则在ftest
列中所有具有相同 id 的后续行都应标记为 1,并且它不依赖于当前ftest
值,如下面的 dataframe 所示:
ids![]() |
ftest![]() |
vals![]() |
has_hist ![]() |
|
---|---|---|---|---|
0 ![]() |
Q52EG ![]() |
0 ![]() |
0 ![]() |
0 ![]() |
1 ![]() |
Q52EG ![]() |
0 ![]() |
1 ![]() |
0 ![]() |
2 ![]() |
Q52EG ![]() |
1 ![]() |
2 ![]() |
0 ![]() |
3 ![]() |
Q52EG ![]() |
1 ![]() |
3 ![]() |
1 ![]() |
4 ![]() |
Q52EG ![]() |
1 ![]() |
4 ![]() |
1 ![]() |
5 ![]() |
QQ8Q4 ![]() |
0 ![]() |
5 ![]() |
0 ![]() |
6 ![]() |
QQ8Q4 ![]() |
0 ![]() |
6 ![]() |
0 ![]() |
7 ![]() |
QQ8Q4 ![]() |
1 ![]() |
7 ![]() |
0 ![]() |
8 ![]() |
QQ8Q4 ![]() |
1 ![]() |
8 ![]() |
1 ![]() |
9 ![]() |
QVIPW ![]() |
1 ![]() |
9 ![]() |
0 ![]() |
I am doing this using a iterative approach like this:我正在使用这样的迭代方法来做到这一点:
previous_present = {}
has_prv_history = []
for index, value in id_df.iterrows():
my_id = value["ids"]
ftest_mentioned = value["ftest"]
previous_flag = 0
if my_id in previous_present.keys():
previous_flag = 1
elif (ftest_mentioned == 1):
previous_present[my_id] = 1
has_prv_history.append(previous_flag)
id_df["has_hist"] = has_prv_history
Can this code be vectorized without using apply
?可以在不使用
apply
的情况下对这段代码进行矢量化吗?
Two key functions for this kind of tasks are shift
and ffill
, applied per group.此类任务的两个关键功能是
shift
和ffill
,每组应用。 For this specific question:对于这个特定的问题:
df2["has_hist"] = df.groupby("ids").ftest.shift().where(lambda s: s.eq(1))
df2["has_hist"] = df2.groupby("ids").has_hist.ffill().fillna(0).astype("int32")
Here is a variant with transform
, which however is often slower than "pure" Pandas operations in my experience:这是一个带有
transform
的变体,但是根据我的经验,它通常比“纯” Pandas 操作慢:
df2 = (
df
.groupby("ids")
.ftest.transform(
lambda s: (
s
.shift()
.where(lambda t: t.eq(1))
.ffill()
.fillna(0)
.astype("int32")
)
)
)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.