简体   繁体   English

使用 Groupby 填充缺失值

[英]Fill In Missing Values With Groupby

I'm trying to fill in missing values after groupby two columns on the planets dataset.我正在尝试在planets数据集的两列groupbyfill in缺失值。

# Load data
df = sns.load_dataset('planets')

# Check naan
df.isna().sum()

method              0
number              0
orbital_period     43
mass              522
distance          227
year                0
dtype: int64

However, after filling in the missing values with group mean, missing values still remain.但是,在用组平均值填充缺失值后,仍然存在缺失值。 I'm not sure why this is happening (I tried this on the titanic dataset and it completely works there)).我不确定为什么会发生这种情况(我在titanic数据集上试过这个,它在那里完全有效))。 Even if I try to fill by each column (no for loop), the problem still shows up.即使我尝试按每一列填充(没有 for 循环),问题仍然出现。

# Select naan column names
null_cols = df.columns[df.isnull().any()]

# Fill in with loop
for col in null_cols:
  df[col] = df.groupby(['method', 'year'])[col].transform(lambda x: x.fillna(x.mean()))

# Check naan values again
df.isna().sum()

method              0
number              0
orbital_period     28
mass              405
distance           26
year                0

What's wrong here?这里有什么问题? Any suggestions would be appreciated.任何建议,将不胜感激。 Thanks!谢谢!

This might do what you want.这可能会做你想要的。 Use DataFrame.fillna choosing the mean as the filler:使用DataFrame.fillna选择平均值作为填充符:

df = sns.load_dataset('planets')
df = df.groupby(['method', 'year']).fillna(df.mean())
df.isna().sum()
df.head()

The reason this is happening is that all of the values for some of the groups you're generating do not have a single non-nan value.发生这种情况的原因是您生成的某些组的所有值都没有一个非 nan 值。

Take the value/col mass for the group ('Microlensing', 2012) it has 6 entries of which there are 0 non-nan values.取该组的值/列mass ('Microlensing', 2012) ,它有 6 个条目,其中有 0 个非 nan 值。 If there are no actual values to take the mean of you can't really calculate a mean which can be used for imputing the other nan-values in the same group.如果没有实际值取平均值,则无法真正计算出可用于估算同一组中其他 nan 值的平均值。

Here is the debug code I used:这是我使用的调试代码:

import math
import seaborn as sns

df = sns.load_dataset("planets")

print(df.isna().sum())

null_cols = df.columns[df.isnull().any()]


def inspect_fillna(x):
    mean_x = x.mean()
    if math.isnan(mean_x):
        print("group=", x.name, ", entries=", len(x), ", all_are_nan=", len(x) == x.isna().sum(), sep="")
    imputed_x = x.fillna(mean_x)
    return imputed_x


for col in null_cols:
    print("\n\ncol=", col, sep="")
    df[col] = df.groupby(["method", "year"])[col].transform(lambda x: inspect_fillna(x))

print(df.isna().sum())

Here is the output:这是 output:

method              0
number              0
orbital_period     43
mass              522
distance          227
year                0
dtype: int64


col=orbital_period
group=('Imaging', 2004), entries=3, all_are_nan=True
group=('Imaging', 2005), entries=1, all_are_nan=True
group=('Imaging', 2007), entries=1, all_are_nan=True
group=('Imaging', 2012), entries=2, all_are_nan=True
group=('Imaging', 2013), entries=7, all_are_nan=True
group=('Microlensing', 2004), entries=1, all_are_nan=True
group=('Microlensing', 2009), entries=2, all_are_nan=True
group=('Microlensing', 2012), entries=6, all_are_nan=True
group=('Microlensing', 2013), entries=4, all_are_nan=True
group=('Transit Timing Variations', 2014), entries=1, all_are_nan=True


col=mass
group=('Astrometry', 2010), entries=1, all_are_nan=True
group=('Astrometry', 2013), entries=1, all_are_nan=True
group=('Eclipse Timing Variations', 2008), entries=2, all_are_nan=True
group=('Eclipse Timing Variations', 2010), entries=2, all_are_nan=True
group=('Eclipse Timing Variations', 2011), entries=3, all_are_nan=True
group=('Imaging', 2004), entries=3, all_are_nan=True
group=('Imaging', 2005), entries=1, all_are_nan=True
group=('Imaging', 2006), entries=4, all_are_nan=True
group=('Imaging', 2007), entries=1, all_are_nan=True
group=('Imaging', 2008), entries=8, all_are_nan=True
group=('Imaging', 2009), entries=3, all_are_nan=True
group=('Imaging', 2010), entries=6, all_are_nan=True
group=('Imaging', 2011), entries=3, all_are_nan=True
group=('Imaging', 2012), entries=2, all_are_nan=True
group=('Imaging', 2013), entries=7, all_are_nan=True
group=('Microlensing', 2004), entries=1, all_are_nan=True
group=('Microlensing', 2005), entries=2, all_are_nan=True
group=('Microlensing', 2006), entries=1, all_are_nan=True
group=('Microlensing', 2008), entries=4, all_are_nan=True
group=('Microlensing', 2009), entries=2, all_are_nan=True
group=('Microlensing', 2010), entries=2, all_are_nan=True
group=('Microlensing', 2011), entries=1, all_are_nan=True
group=('Microlensing', 2012), entries=6, all_are_nan=True
group=('Microlensing', 2013), entries=4, all_are_nan=True
group=('Orbital Brightness Modulation', 2011), entries=2, all_are_nan=True
group=('Orbital Brightness Modulation', 2013), entries=1, all_are_nan=True
group=('Pulsar Timing', 1992), entries=2, all_are_nan=True
group=('Pulsar Timing', 1994), entries=1, all_are_nan=True
group=('Pulsar Timing', 2003), entries=1, all_are_nan=True
group=('Pulsar Timing', 2011), entries=1, all_are_nan=True
group=('Pulsation Timing Variations', 2007), entries=1, all_are_nan=True
group=('Transit', 2002), entries=1, all_are_nan=True
group=('Transit', 2004), entries=5, all_are_nan=True
group=('Transit', 2006), entries=5, all_are_nan=True
group=('Transit', 2007), entries=16, all_are_nan=True
group=('Transit', 2008), entries=17, all_are_nan=True
group=('Transit', 2009), entries=18, all_are_nan=True
group=('Transit', 2010), entries=48, all_are_nan=True
group=('Transit', 2011), entries=80, all_are_nan=True
group=('Transit', 2012), entries=92, all_are_nan=True
group=('Transit', 2014), entries=40, all_are_nan=True
group=('Transit Timing Variations', 2011), entries=1, all_are_nan=True
group=('Transit Timing Variations', 2012), entries=1, all_are_nan=True
group=('Transit Timing Variations', 2013), entries=1, all_are_nan=True
group=('Transit Timing Variations', 2014), entries=1, all_are_nan=True


col=distance
group=('Eclipse Timing Variations', 2009), entries=1, all_are_nan=True
group=('Eclipse Timing Variations', 2011), entries=3, all_are_nan=True
group=('Eclipse Timing Variations', 2012), entries=1, all_are_nan=True
group=('Microlensing', 2004), entries=1, all_are_nan=True
group=('Microlensing', 2005), entries=2, all_are_nan=True
group=('Microlensing', 2006), entries=1, all_are_nan=True
group=('Microlensing', 2008), entries=4, all_are_nan=True
group=('Microlensing', 2009), entries=2, all_are_nan=True
group=('Microlensing', 2010), entries=2, all_are_nan=True
group=('Microlensing', 2011), entries=1, all_are_nan=True
group=('Orbital Brightness Modulation', 2013), entries=1, all_are_nan=True
group=('Pulsar Timing', 1992), entries=2, all_are_nan=True
group=('Pulsar Timing', 1994), entries=1, all_are_nan=True
group=('Pulsar Timing', 2003), entries=1, all_are_nan=True
group=('Pulsation Timing Variations', 2007), entries=1, all_are_nan=True
group=('Transit', 2002), entries=1, all_are_nan=True
group=('Transit Timing Variations', 2014), entries=1, all_are_nan=True
method              0
number              0
orbital_period     28
mass              405
distance           26
year                0
dtype: int64

Possible solution : Consider making your groups larger by removing year or method from your group.可能的解决方案:考虑通过从组中删除yearmethod来扩大您的组。

Try this:尝试这个:

df.fillna(df.groupby(['method', 'year'])[col].transform('mean'))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM