pandas 按聚合 dataframe 按 2 列条件分组

Question

I got this sample DF:我得到了这个样本 DF：

df = pd.DataFrame({'CUSTOM_CRITERIA':[1111,22222,1111,1212,1212,3333,5555, 1111], 
                'AD_UNIT_NAME':['inp2_l_d', 'inp1', 'pixel_d', 'inp2_l_d', 'anchor_m','anchor_m','anchor_m','inp2_l_d'], 
                'TOTAL_CODE_SERVED_COUNT':[10, 20, 10, 12, 18,500,100,50]})

I need to get for each custom_criteria the max total_code_served_count by condition of which has more code served -> anchor_m [total_code served] OR inp2_l_d[total_code served] + pixel_d[total_code served] for each CUSTOM_CRITERIA我需要为每个 custom_criteria 获取最大 total_code_served_count 条件，其中每个 CUSTOM_CRITERIA 提供了更多代码 -> anchor_m [total_code serving] OR inp2_l_d[total_code serving] + pixel_d[total_code serving ]

My current solution looks like this:我目前的解决方案如下所示：

data_dict = clean_data.to_dict(orient='records')

for item in data_dict:
    desktop_impression_max_calculated = sum([d['TOTAL_CODE_SERVED_COUNT'] for d in data_dict if d['CUSTOM_CRITERIA'] == item['CUSTOM_CRITERIA'] and ('inp2_l_d' in d['AD_UNIT_NAME'].lower() or 'pixel_d' in d['AD_UNIT_NAME'].lower())])
    mobile_impression_max_calculated = sum([d['TOTAL_CODE_SERVED_COUNT'] for d in data_dict if d['CUSTOM_CRITERIA'] == item['CUSTOM_CRITERIA'] and 'anchor_m' in d['AD_UNIT_NAME'].lower()])
    item['IMPRESSIONS_MAX'] = max(desktop_impression_max_calculated,mobile_impression_max_calculated)

clean_data = pd.DataFrame(data_dict)   
agg_map = {'IMPRESSIONS_MAX': 'first' }

clean_data = clean_data.groupby('CUSTOM_CRITERIA').agg(agg_map).reset_index()

this takes a long time to run when a high amount of data is present due to N^2 complexity.由于 N^2 复杂性，当存在大量数据时，这需要很长时间才能运行。 I'm sure there is a better and simpler way to do it with pandas.我确信使用 pandas 有更好、更简单的方法。

Answer 1

You can create two masked columns by multiplying the values in TOTAL_CODE_SERVED_COUNT column by the boolean masks m1 and m2 , then groupby these masked columns on CUSTOM_CRITERIA and aggregate using sum , finally take the max along axis=1 to get the final result:您可以通过将TOTAL_CODE_SERVED_COUNT列中的值乘以 boolean 掩码m1和m2来创建两个屏蔽列，然后对groupby上的这些屏蔽列进行CUSTOM_CRITERIA并使用sum聚合，最后沿axis=1取max以获得最终结果：

m1 = df['AD_UNIT_NAME'].str.contains(r'(?i)inp2_l_d|pixel_d')
m2 = df['AD_UNIT_NAME'].str.contains(r'(?i)anchor_m')

pd.DataFrame((df['TOTAL_CODE_SERVED_COUNT'].values * [m1, m2]).T)\
  .groupby(df['CUSTOM_CRITERIA']).sum().max(1).reset_index(name='IMPRESSIONS_MAX')

   CUSTOM_CRITERIA  IMPRESSIONS_MAX
0             1111               70
1             1212               18
2             3333              500
3             5555              100
4            22222                0

pandas 按聚合 dataframe 按 2 列条件分组

问题描述

1 个解决方案

解决方案1
4 已采纳 2021-01-03 08:35:18

pandas 按聚合 dataframe 按 2 列条件分组

问题描述

1 个解决方案

解决方案1 4 已采纳 2021-01-03 08:35:18

解决方案1
4 已采纳 2021-01-03 08:35:18