简体   繁体   English

Pandas dataframe 列值取决于动态行数

[英]Pandas dataframe column value dependent on dynamic number of rows

I've got a dataframe that looks something like this:我有一个看起来像这样的 dataframe:

user用户 current_date当前的日期 prior_date之前的日期 points_scored points_scored
1 1 2021-01-01 2021-01-01 2020-10-01 2020-10-01 5 5
2 2 2021-01-01 2021-01-01 2020-10-01 2020-10-01 4 4
2 2 2021-01-21 2021-01-21 2020-10-21 2020-10-21 4 4
2 2 2021-05-01 2021-05-01 2021-02-01 2021-02-01 4 4

The prior_date column is simply current_date - 3 months and points_scored is the number of points scored on current_date . prior_date列只是current_date - 3 months ,而points_scoredcurrent_date得分的点数。 I'd like to be able to identify which rows had sum(points_scored) >= 8 where for a given user , the rows considered would be where current_date is between current_date and prior_date .我希望能够确定哪些行sum(points_scored) >= 8对于给定的user ,考虑的行将是current_date介于current_dateprior_date之间的位置。 It is guaranteed that no single row will have a value of points_scored >= 8 .保证没有单行的值points_scored >= 8

For example, in the example above, I'd like something like this returned:例如,在上面的示例中,我希望返回如下内容:

user用户 current_date当前的日期 prior_date之前的日期 points_scored points_scored flag旗帜
1 1 2021-01-01 2021-01-01 2021-04-01 2021-04-01 5 5 0 0
2 2 2021-01-01 2021-01-01 2020-10-01 2020-10-01 4 4 0 0
2 2 2021-01-21 2021-01-21 2020-10-21 2020-10-21 4 4 1 1
2 2 2021-05-01 2021-05-01 2021-02-01 2021-02-01 4 4 0 0

The third row shows flag=1 because for row 3's values of current_date=2021-01-21 and prior_date=2020-10-21 , the rows to consider would be rows 2 and 3. We consider row 2 because row 2's current_date=2021-01-01 which is between row 3's current_date and prior_date .第三行显示flag=1 ,因为对于第 3 行的current_date=2021-01-21prior_date=2020-10-21的值,要考虑的行是第 2 行和第 3 行。我们考虑第 2 行,因为第 2 行的current_date=2021-01-01在第 3 行的current_dateprior_date之间。

Ultimately, I'd like to end up with a data structure where it shows distinct user and flag.最终,我希望得到一个显示不同用户和标志的数据结构。 It could be a dataframe or a dictionary-- anything easily referencable.它可以是 dataframe 或字典——任何容易引用的东西。

user用户 flag旗帜
1 1 0 0
2 2 1 1

To do this, I'm doing something like this:为此,我正在做这样的事情:

flags = {}
ids = list(df['user'].value_counts()[df['user'].value_counts() > 2].index)
for id in ids:
    temp_df = df[df['user'] == id]
    for idx, row in temp_df.iterrows():
        cur_date = row['current_date']
        prior_date = row['prior_date']
        temp_total = temp_df[(temp_df['current_date'] <= cur_date) & (cand_df['current_date'] >= prior_date)]['points_scored'].sum()
        if temp_total >= 8:
            flags[id] = 1
            break

The code above works, but just takes way too long to actually execute.上面的代码有效,但实际执行时间太长了。

You are right, performing loops on large data can be quite time consuming.你是对的,对大数据执行循环可能非常耗时。 This is where the power of numpy comes into full play.这就是 numpy 的威力充分发挥的地方。 I am still not sure of what you want but i can help address the speed Numpy.select can perform your if else statement efficiently.我仍然不确定您想要什么,但我可以帮助解决速度 Numpy.select 可以有效地执行您的 if else 语句。

import pandas as pd 
import numpy as np

condition = [df['points_scored']==5, df['points_scored']==4, df['points_scored'] ==3] # <-- put your condition here
choices = ['okay', 'hmmm!', 'yes'] #<--what you want returned (the order is important)
np.select(condition,choices,default= 'default value')

Also, you might want to more succint what you want.此外,您可能想要更简洁地表达您想要的内容。 meanwhile you can refactor your loops with np.select()同时你可以用 np.select() 重构你的循环

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 计算pandas数据框中另一列对值分组之前的行数 - count number of rows before a value group by another column in pandas dataframe 熊猫:根据列值的行数对数据框进行排序 - Pandas: sort dataframe on the basis of number of rows for the column value 给定唯一的列值,Pandas 数据框如何删除以行长小于数字为条件的行? - Pandas dataframe how to remove rows conditioned on the length of rows being smaller than a number, given a unique column value? Pandas:根据其他列中的值变化删除数据框的百分比 - Pandas: Delete percentage of dataframe dependent on value change in other column plot 数据来自 pandas DataFrame,颜色取决于列值 - plot data from pandas DataFrame, colour dependent on column value 大熊猫:通过列的值提取某些行作为数据框 - pandas: extract certain rows as a dataframe by the value of a column Pandas 如果值在列 dataframe 中,则获取行 - Pandas Get rows if value is in column dataframe 根据列值重复 pandas DataFrame 中的行 - Repeat rows in a pandas DataFrame based on column value 在Pandas数据框中将行折叠为一列值 - Collapsing rows into one column value in pandas dataframe 如何计算 Pandas dataframe 中同时包含一组列中的值和另一列中的另一个值的行数? - How to count the number of rows containing both a value in a set of columns and another value in another column in a Pandas dataframe?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM