如何在不丢失分组依据的列的情况下基于 groupby 变换填充 NaN？

Question

I have a dataset containing heights, weights etc, and I intend to fill the NaN values with the mean value for that gender.我有一个包含身高、体重等的数据集，我打算用该性别的平均值填充 NaN 值。

Example dataset:示例数据集：

    gender    height    weight
1     M          5       NaN
2     F          4       NaN
3     F         NaN        40
4     M         NaN        50

df = df.groupby("Gender").transform(lambda x: x.fillna(x.mean()))

current output:当前 output：

     height    weight
1       5        50
2       4        40
3       4        40
4       5        50

Expected output:预计 output：

    gender    height    weight
1     M          5        50
2     F          4        40
3     F          4        40
4     M          5        50

Unfortunately this drops the column Gender which is important later on.不幸的是，这会删除稍后很重要的性别列。

Answer 1

How about looping through the 2 columns you want to fill, and perform GroupBy.transform , grouping by 'gender':如何遍历要填充的 2 列，然后执行GroupBy.transform ，按“性别”分组：

for col in ['height','weight']:
    df[col] = df.groupby('gender')[col].transform(lambda x: x.fillna(x.mean()))

print(df)

  gender  height  weight
0      M     5.0    50.0
1      F     4.0    40.0
2      F     4.0    40.0
3      M     5.0    50.0

If you want to fill all the numerical columns, you can get them in a list , and perform the same approach:如果要填充所有数字列，可以将它们放入list ，并执行相同的方法：

features_to_impute = [
        x for x in df.columns if df[x].dtypes != 'O' and df[x].isnull().mean() > 0
        ]

for col in features_to_impute:
    df[col] = df.groupby('gender')[col].transform(lambda x: x.fillna(x.mean()))

Answer 2

Instead of using groupby, you can reach your expected output like below:除了使用 groupby，您还可以达到预期的 output，如下所示：

 df = df.groupby('gender').apply(lambda x: x.fillna(x.mean()))

Answer 3

I have a dataset containing heights, weights etc, and I intend to fill the NaN values with the mean value for that gender.我有一个包含身高、体重等的数据集，我打算用该性别的平均值填充 NaN 值。

Example dataset:示例数据集：

    gender    height    weight
1     M          5       NaN
2     F          4       NaN
3     F         NaN        40
4     M         NaN        50

df = df.groupby("Gender").transform(lambda x: x.fillna(x.mean()))

current output:当前 output：

     height    weight
1       5        50
2       4        40
3       4        40
4       5        50

Expected output:预期 output：

    gender    height    weight
1     M          5        50
2     F          4        40
3     F          4        40
4     M          5        50

Unfortunately this drops the column Gender which is important later on.不幸的是，这会删除稍后很重要的性别列。

如何在不丢失分组依据的列的情况下基于 groupby 变换填充 NaN？

问题描述

2 个解决方案

解决方案1
1 已采纳 2021-02-08 16:05:53

解决方案2
0 2021-02-08 15:57:43

解决方案3
0 2021-02-08 16:30:23

如何在不丢失分组依据的列的情况下基于 groupby 变换填充 NaN？

问题描述

2 个解决方案

解决方案1 1 已采纳 2021-02-08 16:05:53

解决方案2 0 2021-02-08 15:57:43

解决方案3 0 2021-02-08 16:30:23

解决方案1
1 已采纳 2021-02-08 16:05:53

解决方案2
0 2021-02-08 15:57:43

解决方案3
0 2021-02-08 16:30:23