简体   繁体   English

Pandas Groupby有条件聚合

[英]Pandas Groupby Conditional Aggregation

Let's say you have a dataframe as follows: 假设您有一个数据框,如下所示:

data = pd.DataFrame({'Year': [2019]*5+[2020]*5,
          'Month': [1,1,2,2,3]*2,
          'Hour': [0,1,2,3,4]*2,
          'Value': [0.2,0.3,0.2,0.1,0.4,0.3,0.2,0.5,0.1,0.2]})

Then, set "low" times to be hours between 1 and 3 (inclusive), and "high" times to be all other hours (in this case, hours 0 and 4). 然后,将“低”时间设置为1到3(含)之间的小时,将“高”时间设置为所有其他小时(在这种情况下为0和4小时)。 What I would like to do is get the average Value for the "low" and "high" times for each Year and Month . 我想这样做的就是平均Value为每一个“低”和“高”次YearMonth Ideally, these would be appended as new columns to the groupby() dataframe (ie, the final dataframe would have Year , Month , Low , and High columns). 理想情况下,这些将作为新列追加到groupby()数据帧(即,最终数据帧将具有YearMonthLowHigh列)。

For loops work, but they're not ideal. For循环有效,但并不理想。 I could also create a dummy variable (for instance, 0s and 1s) to signify the "low" and "high" times in the dataframe to groupby. 我还可以创建一个虚拟变量(例如0和1),以表示要分组的数据帧中的“低”和“高”时间。 However, it seems to me that there should be some way to use Pandas groupby(['Year', 'Month']).agg(...) to achieve the result in an efficient/optimal way. 但是,在我看来,应该有某种方法可以使用Pandas groupby(['Year','Month'])。agg(...)以高效/最佳的方式获得结果。 I haven't had any luck thus far using groupby+agg, mainly because agg() uses only a series (not the remaining dataframe), so one can't use a conditional within agg based on the Hour to calculate the average Value . 我已经没有任何运气迄今为止使用GROUPBY + AGG,主要是因为AGG()只使用一个系列(不剩余数据帧),因此不能使用条件基础上,AGG内Hour计算的平均Value

Expected result from sample data: 样本数据的预期结果:

Year Month High Low 0 2019 1 0.2 0.30 1 2019 2 NaN 0.15 2 2019 3 0.4 NaN 3 2020 1 0.3 0.20 4 2020 2 NaN 0.30 5 2020 3 0.2 NaN

Any help is appreciated :) 任何帮助表示赞赏:)

Consider pivot_table after creating a low/high type indicator field: 创建低/高类型指示符字段后,请考虑一下pivot_table

data['Type'] = np.where(data['Hour'].between(1,3), 'Low', 'High')

pvt_df = (pd.pivot_table(data, index=['Year', 'Month'], 
                         columns='Type', values='Value', aggfunc=np.mean)
            .reset_index()
            .rename_axis(None, axis='columns')
         )    

print(pvt_df)
#    Year  Month  High   Low
# 0  2019      1   0.2  0.30
# 1  2019      2   NaN  0.15
# 2  2019      3   0.4   NaN
# 3  2020      1   0.3  0.20
# 4  2020      2   NaN  0.30
# 5  2020      3   0.2   NaN

Might not win the price for most beautiful piece of code, but if I understand you correctly, this is what you want. 可能无法赢得最精美代码的价格,但是如果我正确理解您的话,这就是您想要的。

(correct me if im wrong since theres no expected output included) (如果我做错了,请纠正我,因为其中没有预期的输出)

Groupby 4 times and concat the years and months together. Groupby 4次,将年份和月份连在一起。 After that do a final merge to get all the columns together 之后,进行最终合并以将所有列合并在一起

low_hours = [1, 2, 3]

groupby1 = data[data.Hour.isin(low_hours)].groupby('Year').Value.mean().reset_index().rename({'Value':'Value_year_low'},axis=1)
groupby2 = data[~data.Hour.isin(low_hours)].groupby('Year').Value.mean().reset_index().rename({'Value':'Value_year_high'},axis=1).drop('Year', axis=1)
groupby3 = data[data.Hour.isin(low_hours)].groupby(['Year','Month']).Value.mean().reset_index().rename({'Value':'Value_month_low'},axis=1)
groupby4 = data[~data.Hour.isin(low_hours)].groupby(['Year','Month']).Value.mean().reset_index().rename({'Value':'Value_month_high'},axis=1).drop(['Year','Month'], axis=1)

df_final1 = pd.concat([groupby1, groupby2], axis=1)
df_final2 = pd.concat([groupby3, groupby4], axis=1)

df_final = pd.merge(df_final1, df_final2, on='Year')
print(df_final)
   Year  Value_year_low  Value_year_high  Month  Value_month_low  \
0  2019        0.200000             0.30      1             0.30   
1  2019        0.200000             0.30      2             0.15   
2  2020        0.266667             0.25      1             0.20   
3  2020        0.266667             0.25      2             0.30   

   Value_month_high  
0               0.2  
1               0.4  
2               0.3  
3               0.2  
data = pd.DataFrame({'Year': [2019]*5+[2020]*5,
          'Month': [1,1,2,2,3]*2,
          'Hour': [0,1,2,3,4]*2,
          'Value': [0.2,0.3,0.2,0.1,0.4,0.3,0.2,0.5,0.1,0.2]})

data['low'] = (data['Hour'] > 0) & (data['Hour'] < 4)

data[data['low']][['Month', 'Year']].mean()
data[~data['low']][['Month', 'Year']].mean()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM