简体   繁体   English

Pandas Groupby对不同的列使用不同的agg方法

[英]Pandas Groupby using different agg methods for different columns

Here is the scenario: 这是场景:

  • I have a large ordered dataset with 314 columns and over 300.000 lines for a ML problem. 我有一个大型有序数据集,包含314列和超过300.000行的ML问题。

  • I wanna group by the dataset by column X (suppliers). 我想通过X列(供应商)按数据集进行分组。

  • One column is a datetime type, some columns are numeric by nature and others were one-hot encoded from some categorical columns. 一列是日期时间类型,一些列本质上是数字的,而另一列是从一些分类列中进行的一次热编码。

Desired output: 期望的输出:

  • I wanna groupby column X, and aggregate the numeric columns by "mean", some columns by "last", and the one-hot-encoded ones by "sum". 我想从列X中分组,并将数字列聚合为“均值”,将某些列聚合为“最后”,将一个热编码的列按“总和”聚合。 All on the same agg method. 全部采用相同的agg方法。

Since we are talking about a 314 columns dataset I can't just create a dict containing each column. 由于我们讨论的是314列数据集,因此我不能仅创建包含每列的dict。

df_train.groupby('Supplier').agg({<some columns> : 'last', <some columns>: 'sum', <some columns>: 'mean' })

PS: I ordered the columns using the sequence that I wanna apply the different aggregations. PS:我使用我想要应用不同聚合的序列来排序列。

You could use select_dtypes to get the columns that are numeric, and use these in a dictionary comprehension. 您可以使用select_dtypes来获取数字列,并在字典理解中使用它们。

numeric_cols = df_train.select_dtypes('numeric').columns

agg_dict = {c: 'sum' if c in numeric_cols else 'last' for c in df_train.columns}

grouped = df_train.groupby('Supplier').agg(agg_dict)

With regards to your one-hot encoded columns, you will need to provide more information as to how they might be identified. 关于您的单热编码列,您需要提供有关如何识别它们的更多信息。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM