[英]Pandas dataframe column value dependent on dynamic number of rows
I've got a dataframe that looks something like this:我有一个看起来像这样的 dataframe:
user用户 | current_date当前的日期 | prior_date之前的日期 | points_scored points_scored |
---|---|---|---|
1 1 | 2021-01-01 2021-01-01 | 2020-10-01 2020-10-01 | 5 5 |
2 2 | 2021-01-01 2021-01-01 | 2020-10-01 2020-10-01 | 4 4 |
2 2 | 2021-01-21 2021-01-21 | 2020-10-21 2020-10-21 | 4 4 |
2 2 | 2021-05-01 2021-05-01 | 2021-02-01 2021-02-01 | 4 4 |
The prior_date
column is simply current_date - 3 months
and points_scored
is the number of points scored on current_date
. prior_date
列只是current_date - 3 months
,而points_scored
是current_date
得分的点数。 I'd like to be able to identify which rows had sum(points_scored) >= 8
where for a given user
, the rows considered would be where current_date
is between current_date
and prior_date
.我希望能够确定哪些行sum(points_scored) >= 8
对于给定的user
,考虑的行将是current_date
介于current_date
和prior_date
之间的位置。 It is guaranteed that no single row will have a value of points_scored >= 8
.保证没有单行的值points_scored >= 8
。
For example, in the example above, I'd like something like this returned:例如,在上面的示例中,我希望返回如下内容:
user用户 | current_date当前的日期 | prior_date之前的日期 | points_scored points_scored | flag旗帜 |
---|---|---|---|---|
1 1 | 2021-01-01 2021-01-01 | 2021-04-01 2021-04-01 | 5 5 | 0 0 |
2 2 | 2021-01-01 2021-01-01 | 2020-10-01 2020-10-01 | 4 4 | 0 0 |
2 2 | 2021-01-21 2021-01-21 | 2020-10-21 2020-10-21 | 4 4 | 1 1 |
2 2 | 2021-05-01 2021-05-01 | 2021-02-01 2021-02-01 | 4 4 | 0 0 |
The third row shows flag=1
because for row 3's values of current_date=2021-01-21
and prior_date=2020-10-21
, the rows to consider would be rows 2 and 3. We consider row 2 because row 2's current_date=2021-01-01
which is between row 3's current_date
and prior_date
.第三行显示flag=1
,因为对于第 3 行的current_date=2021-01-21
和prior_date=2020-10-21
的值,要考虑的行是第 2 行和第 3 行。我们考虑第 2 行,因为第 2 行的current_date=2021-01-01
在第 3 行的current_date
和prior_date
之间。
Ultimately, I'd like to end up with a data structure where it shows distinct user and flag.最终,我希望得到一个显示不同用户和标志的数据结构。 It could be a dataframe or a dictionary-- anything easily referencable.它可以是 dataframe 或字典——任何容易引用的东西。
user用户 | flag旗帜 |
---|---|
1 1 | 0 0 |
2 2 | 1 1 |
To do this, I'm doing something like this:为此,我正在做这样的事情:
flags = {}
ids = list(df['user'].value_counts()[df['user'].value_counts() > 2].index)
for id in ids:
temp_df = df[df['user'] == id]
for idx, row in temp_df.iterrows():
cur_date = row['current_date']
prior_date = row['prior_date']
temp_total = temp_df[(temp_df['current_date'] <= cur_date) & (cand_df['current_date'] >= prior_date)]['points_scored'].sum()
if temp_total >= 8:
flags[id] = 1
break
The code above works, but just takes way too long to actually execute.上面的代码有效,但实际执行的时间太长了。
You are right, performing loops on large data can be quite time consuming.你是对的,对大数据执行循环可能非常耗时。 This is where the power of numpy comes into full play.这就是 numpy 的威力充分发挥的地方。 I am still not sure of what you want but i can help address the speed Numpy.select can perform your if else statement efficiently.我仍然不确定您想要什么,但我可以帮助解决速度 Numpy.select 可以有效地执行您的 if else 语句。
import pandas as pd
import numpy as np
condition = [df['points_scored']==5, df['points_scored']==4, df['points_scored'] ==3] # <-- put your condition here
choices = ['okay', 'hmmm!', 'yes'] #<--what you want returned (the order is important)
np.select(condition,choices,default= 'default value')
Also, you might want to more succint what you want.此外,您可能想要更简洁地表达您想要的内容。 meanwhile you can refactor your loops with np.select()同时你可以用 np.select() 重构你的循环
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.