Pandas dataframe 列值取决于动态行数

Question

I've got a dataframe that looks something like this:我有一个看起来像这样的 dataframe：

user用户	current_date当前的日期	prior_date之前的日期	points_scored points_scored
1 1	2021-01-01 2021-01-01	2020-10-01 2020-10-01	5 5
2 2	2021-01-01 2021-01-01	2020-10-01 2020-10-01	4 4
2 2	2021-01-21 2021-01-21	2020-10-21 2020-10-21	4 4
2 2	2021-05-01 2021-05-01	2021-02-01 2021-02-01	4 4

The prior_date column is simply current_date - 3 months and points_scored is the number of points scored on current_date . prior_date列只是current_date - 3 months ，而points_scored是current_date得分的点数。 I'd like to be able to identify which rows had sum(points_scored) >= 8 where for a given user , the rows considered would be where current_date is between current_date and prior_date .我希望能够确定哪些行sum(points_scored) >= 8对于给定的user ，考虑的行将是current_date介于current_date和prior_date之间的位置。 It is guaranteed that no single row will have a value of points_scored >= 8 .保证没有单行的值points_scored >= 8 。

For example, in the example above, I'd like something like this returned:例如，在上面的示例中，我希望返回如下内容：

user用户	current_date当前的日期	prior_date之前的日期	points_scored points_scored	flag旗帜
1 1	2021-01-01 2021-01-01	2021-04-01 2021-04-01	5 5	0 0
2 2	2021-01-01 2021-01-01	2020-10-01 2020-10-01	4 4	0 0
2 2	2021-01-21 2021-01-21	2020-10-21 2020-10-21	4 4	1 1
2 2	2021-05-01 2021-05-01	2021-02-01 2021-02-01	4 4	0 0

The third row shows flag=1 because for row 3's values of current_date=2021-01-21 and prior_date=2020-10-21 , the rows to consider would be rows 2 and 3. We consider row 2 because row 2's current_date=2021-01-01 which is between row 3's current_date and prior_date .第三行显示flag=1 ，因为对于第 3 行的current_date=2021-01-21和prior_date=2020-10-21的值，要考虑的行是第 2 行和第 3 行。我们考虑第 2 行，因为第 2 行的current_date=2021-01-01在第 3 行的current_date和prior_date之间。

Ultimately, I'd like to end up with a data structure where it shows distinct user and flag.最终，我希望得到一个显示不同用户和标志的数据结构。 It could be a dataframe or a dictionary-- anything easily referencable.它可以是 dataframe 或字典——任何容易引用的东西。

user用户	flag旗帜
1 1	0 0
2 2	1 1

To do this, I'm doing something like this:为此，我正在做这样的事情：

flags = {}
ids = list(df['user'].value_counts()[df['user'].value_counts() > 2].index)
for id in ids:
    temp_df = df[df['user'] == id]
    for idx, row in temp_df.iterrows():
        cur_date = row['current_date']
        prior_date = row['prior_date']
        temp_total = temp_df[(temp_df['current_date'] <= cur_date) & (cand_df['current_date'] >= prior_date)]['points_scored'].sum()
        if temp_total >= 8:
            flags[id] = 1
            break

The code above works, but just takes way too long to actually execute.上面的代码有效，但实际执行的时间太长了。

Answer 1

You are right, performing loops on large data can be quite time consuming.你是对的，对大数据执行循环可能非常耗时。 This is where the power of numpy comes into full play.这就是 numpy 的威力充分发挥的地方。 I am still not sure of what you want but i can help address the speed Numpy.select can perform your if else statement efficiently.我仍然不确定您想要什么，但我可以帮助解决速度 Numpy.select 可以有效地执行您的 if else 语句。

import pandas as pd 
import numpy as np

condition = [df['points_scored']==5, df['points_scored']==4, df['points_scored'] ==3] # <-- put your condition here
choices = ['okay', 'hmmm!', 'yes'] #<--what you want returned (the order is important)
np.select(condition,choices,default= 'default value')

Also, you might want to more succint what you want.此外，您可能想要更简洁地表达您想要的内容。 meanwhile you can refactor your loops with np.select()同时你可以用 np.select() 重构你的循环

Pandas dataframe 列值取决于动态行数

问题描述

1 个解决方案

解决方案1
0 2021-04-13 03:00:07

Pandas dataframe 列值取决于动态行数

问题描述

1 个解决方案

解决方案1 0 2021-04-13 03:00:07

解决方案1
0 2021-04-13 03:00:07