Pandas dataframe 列值取决于动态行数

Question

我有一个看起来像这样的 dataframe：

用户	当前的日期	之前的日期	points_scored
1	2021-01-01	2020-10-01	5
2	2021-01-01	2020-10-01	4
2	2021-01-21	2020-10-21	4
2	2021-05-01	2021-02-01	4

prior_date列只是current_date - 3 months ，而points_scored是current_date得分的点数。 我希望能够确定哪些行sum(points_scored) >= 8对于给定的user ，考虑的行将是current_date介于current_date和prior_date之间的位置。 保证没有单行的值points_scored >= 8 。

例如，在上面的示例中，我希望返回如下内容：

用户	当前的日期	之前的日期	points_scored	旗帜
1	2021-01-01	2021-04-01	5	0
2	2021-01-01	2020-10-01	4	0
2	2021-01-21	2020-10-21	4	1
2	2021-05-01	2021-02-01	4	0

第三行显示flag=1 ，因为对于第 3 行的current_date=2021-01-21和prior_date=2020-10-21的值，要考虑的行是第 2 行和第 3 行。我们考虑第 2 行，因为第 2 行的current_date=2021-01-01在第 3 行的current_date和prior_date之间。

最终，我希望得到一个显示不同用户和标志的数据结构。 它可以是 dataframe 或字典——任何容易引用的东西。

用户	旗帜
1	0
2	1

为此，我正在做这样的事情：

flags = {}
ids = list(df['user'].value_counts()[df['user'].value_counts() > 2].index)
for id in ids:
    temp_df = df[df['user'] == id]
    for idx, row in temp_df.iterrows():
        cur_date = row['current_date']
        prior_date = row['prior_date']
        temp_total = temp_df[(temp_df['current_date'] <= cur_date) & (cand_df['current_date'] >= prior_date)]['points_scored'].sum()
        if temp_total >= 8:
            flags[id] = 1
            break

上面的代码有效，但实际执行的时间太长了。

Answer 1

你是对的，对大数据执行循环可能非常耗时。 这就是 numpy 的威力充分发挥的地方。 我仍然不确定您想要什么，但我可以帮助解决速度 Numpy.select 可以有效地执行您的 if else 语句。

import pandas as pd 
import numpy as np

condition = [df['points_scored']==5, df['points_scored']==4, df['points_scored'] ==3] # <-- put your condition here
choices = ['okay', 'hmmm!', 'yes'] #<--what you want returned (the order is important)
np.select(condition,choices,default= 'default value')

此外，您可能想要更简洁地表达您想要的内容。 同时你可以用 np.select() 重构你的循环

Pandas dataframe 列值取决于动态行数

问题描述

1 个解决方案

解决方案1
0 2021-04-13 03:00:07

Pandas dataframe 列值取决于动态行数

问题描述

1 个解决方案

解决方案1 0 2021-04-13 03:00:07

解决方案1
0 2021-04-13 03:00:07