Pandas Groupby对不同的列使用不同的agg方法

Question

Here is the scenario: 这是场景：

I have a large ordered dataset with 314 columns and over 300.000 lines for a ML problem. 我有一个大型有序数据集，包含314列和超过300.000行的ML问题。
I wanna group by the dataset by column X (suppliers). 我想通过X列（供应商）按数据集进行分组。
One column is a datetime type, some columns are numeric by nature and others were one-hot encoded from some categorical columns. 一列是日期时间类型，一些列本质上是数字的，而另一列是从一些分类列中进行的一次热编码。

Desired output: 期望的输出：

I wanna groupby column X, and aggregate the numeric columns by "mean", some columns by "last", and the one-hot-encoded ones by "sum". 我想从列X中分组，并将数字列聚合为“均值”，将某些列聚合为“最后”，将一个热编码的列按“总和”聚合。 All on the same agg method. 全部采用相同的agg方法。

Since we are talking about a 314 columns dataset I can't just create a dict containing each column. 由于我们讨论的是314列数据集，因此我不能仅创建包含每列的dict。

df_train.groupby('Supplier').agg({<some columns> : 'last', <some columns>: 'sum', <some columns>: 'mean' })

PS: I ordered the columns using the sequence that I wanna apply the different aggregations. PS：我使用我想要应用不同聚合的序列来排序列。

Answer 1

You could use select_dtypes to get the columns that are numeric, and use these in a dictionary comprehension. 您可以使用select_dtypes来获取数字列，并在字典理解中使用它们。

numeric_cols = df_train.select_dtypes('numeric').columns

agg_dict = {c: 'sum' if c in numeric_cols else 'last' for c in df_train.columns}

grouped = df_train.groupby('Supplier').agg(agg_dict)

With regards to your one-hot encoded columns, you will need to provide more information as to how they might be identified. 关于您的单热编码列，您需要提供有关如何识别它们的更多信息。

Pandas Groupby对不同的列使用不同的agg方法

问题描述

1 个解决方案

解决方案1
0 2019-05-03 14:55:59

Pandas Groupby对不同的列使用不同的agg方法

问题描述

1 个解决方案

解决方案1 0 2019-05-03 14:55:59

解决方案1
0 2019-05-03 14:55:59