简体   繁体   English

熊猫:以不同的方式对每列进行分组

[英]Pandas: Groupby each column in a different way

Let's say that I have the following data-frame: 假设我有以下数据框:

df = pd.DataFrame({"unique_id": [1, 1, 1], "att1_amr": [11, 11, 11], "att2_nominal": [1, np.nan, np.nan], "att3_nominal": [np.nan, 1, np.nan], "att4_bok": [33.33, 33.33, 33.33], "att5_nominal": [np.nan, np.nan, np.nan], "att6_zpq": [22.22, 22.22, 22.22]})

What I want to do is group-by the rows of the data-frame by unique_id such that I can apply a separate group-by operation on the columns that contain the word nominal and a separate to all other. 我想做的是通过unique_id对数据帧的行进行unique_id这样我就可以对包含单词nominal的列和所有其他列进行单独的分组操作。 To be more specific, I want to group-by the columns that contain nominal using sum(min_count = 1) and the other with first() or last() . 更具体地说,我想使用sum(min_count = 1)对包含nominal的列进行分组,并使用first()last()对其他列进行分组。 The result should be the following: 结果应为以下内容:

df_result = pd.DataFrame({"unique_id": [1], "att1_amr": [11], "att2_nominal": [1], "att3_nominal": [1], "att4_bok": [33.33], "att5_nominal": [np.nan], "att6_zpq": [22.22]})

Thank you! 谢谢!

You can create dictionary dynamically - first all columns with nominal with lambda function and then all another columns with last and merge it together, last call DataFrameGroupBy.agg : 您可以动态创建字典-首先使用lambda函数创建所有带有nominal列,然后使用last与所有其他列并将其合并在一起,最后调用DataFrameGroupBy.agg

d1 = dict.fromkeys(df.columns[df.columns.str.contains('nominal')], 
                   lambda x : x.sum(min_count=1))

d2 = dict.fromkeys(df.columns.difference(['unique_id'] + list(d1)), 'last')
d = {**d1, **d2}

df = df.groupby('unique_id').agg(d)
print (df)
           att2_nominal  att3_nominal  att5_nominal  att1_amr  att4_bok  \
unique_id                                                                 
1                   1.0           1.0           NaN        11     33.33   

           att6_zpq  
unique_id            
1             22.22  

Another more cleaner solution: 另一个更清洁的解决方案:

d = {k: (lambda x : x.sum(min_count=1)) 
     if 'nominal' in k 
     else 'last' 
     for k in df.columns.difference(['unique_id'])}

df = df.groupby('unique_id').agg(d)
print (df)
           att1_amr  att2_nominal  att3_nominal  att4_bok  att5_nominal  \
unique_id                                                                 
1                11           1.0           1.0     33.33           NaN   

           att6_zpq  
unique_id            
1             22.22  

Why not just: 为什么不只是:

>>> df.ffill().bfill().drop_duplicates()
   att1_amr  att2_nominal  att3_nominal  att4_bok  att5_nominal  att6_zpq  \
0        11           1.0           1.0     33.33           NaN     22.22   

   unique_id  
0          1  
>>> 

The solution provided by @jezrael works just fine while being the most elegant one, however, I ran into severe performance issues. @jezrael提供的解决方案虽然是最优雅的,但效果很好,但是我遇到了严重的性能问题。 Surprisingly, I found this to be a much faster solution while achieving the same goal. 令人惊讶的是,我发现这是实现相同目标的更快解决方案。

nominal_cols = df.filter(like="nominal").columns.values
other_cols = [col for col in df.columns.values if col not in nominal_cols and col != "unique_id"]
df1 = df.groupby('unique_id', as_index=False)[nominal_cols].sum(min_count=1)
df2 = df.groupby('unique_id', as_index=False)[other_cols].first()
pd.merge(df1, df2, on=["unique_id"], how="inner")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM