[英]Fill In Missing Values With Groupby
I'm trying to fill in
missing values after groupby
two columns on the planets
dataset.我正在尝试在
planets
数据集的两列groupby
后fill in
缺失值。
# Load data
df = sns.load_dataset('planets')
# Check naan
df.isna().sum()
method 0
number 0
orbital_period 43
mass 522
distance 227
year 0
dtype: int64
However, after filling in the missing values with group mean, missing values still remain.但是,在用组平均值填充缺失值后,仍然存在缺失值。 I'm not sure why this is happening (I tried this on the
titanic
dataset and it completely works there)).我不确定为什么会发生这种情况(我在
titanic
数据集上试过这个,它在那里完全有效))。 Even if I try to fill by each column (no for loop), the problem still shows up.即使我尝试按每一列填充(没有 for 循环),问题仍然出现。
# Select naan column names
null_cols = df.columns[df.isnull().any()]
# Fill in with loop
for col in null_cols:
df[col] = df.groupby(['method', 'year'])[col].transform(lambda x: x.fillna(x.mean()))
# Check naan values again
df.isna().sum()
method 0
number 0
orbital_period 28
mass 405
distance 26
year 0
What's wrong here?这里有什么问题? Any suggestions would be appreciated.
任何建议,将不胜感激。 Thanks!
谢谢!
This might do what you want.这可能会做你想要的。 Use DataFrame.fillna choosing the mean as the filler:
使用DataFrame.fillna选择平均值作为填充符:
df = sns.load_dataset('planets')
df = df.groupby(['method', 'year']).fillna(df.mean())
df.isna().sum()
df.head()
The reason this is happening is that all of the values for some of the groups you're generating do not have a single non-nan value.发生这种情况的原因是您生成的某些组的所有值都没有一个非 nan 值。
Take the value/col mass
for the group ('Microlensing', 2012)
it has 6 entries of which there are 0 non-nan values.取该组的值/列
mass
('Microlensing', 2012)
,它有 6 个条目,其中有 0 个非 nan 值。 If there are no actual values to take the mean of you can't really calculate a mean which can be used for imputing the other nan-values in the same group.如果没有实际值取平均值,则无法真正计算出可用于估算同一组中其他 nan 值的平均值。
Here is the debug code I used:这是我使用的调试代码:
import math
import seaborn as sns
df = sns.load_dataset("planets")
print(df.isna().sum())
null_cols = df.columns[df.isnull().any()]
def inspect_fillna(x):
mean_x = x.mean()
if math.isnan(mean_x):
print("group=", x.name, ", entries=", len(x), ", all_are_nan=", len(x) == x.isna().sum(), sep="")
imputed_x = x.fillna(mean_x)
return imputed_x
for col in null_cols:
print("\n\ncol=", col, sep="")
df[col] = df.groupby(["method", "year"])[col].transform(lambda x: inspect_fillna(x))
print(df.isna().sum())
Here is the output:这是 output:
method 0
number 0
orbital_period 43
mass 522
distance 227
year 0
dtype: int64
col=orbital_period
group=('Imaging', 2004), entries=3, all_are_nan=True
group=('Imaging', 2005), entries=1, all_are_nan=True
group=('Imaging', 2007), entries=1, all_are_nan=True
group=('Imaging', 2012), entries=2, all_are_nan=True
group=('Imaging', 2013), entries=7, all_are_nan=True
group=('Microlensing', 2004), entries=1, all_are_nan=True
group=('Microlensing', 2009), entries=2, all_are_nan=True
group=('Microlensing', 2012), entries=6, all_are_nan=True
group=('Microlensing', 2013), entries=4, all_are_nan=True
group=('Transit Timing Variations', 2014), entries=1, all_are_nan=True
col=mass
group=('Astrometry', 2010), entries=1, all_are_nan=True
group=('Astrometry', 2013), entries=1, all_are_nan=True
group=('Eclipse Timing Variations', 2008), entries=2, all_are_nan=True
group=('Eclipse Timing Variations', 2010), entries=2, all_are_nan=True
group=('Eclipse Timing Variations', 2011), entries=3, all_are_nan=True
group=('Imaging', 2004), entries=3, all_are_nan=True
group=('Imaging', 2005), entries=1, all_are_nan=True
group=('Imaging', 2006), entries=4, all_are_nan=True
group=('Imaging', 2007), entries=1, all_are_nan=True
group=('Imaging', 2008), entries=8, all_are_nan=True
group=('Imaging', 2009), entries=3, all_are_nan=True
group=('Imaging', 2010), entries=6, all_are_nan=True
group=('Imaging', 2011), entries=3, all_are_nan=True
group=('Imaging', 2012), entries=2, all_are_nan=True
group=('Imaging', 2013), entries=7, all_are_nan=True
group=('Microlensing', 2004), entries=1, all_are_nan=True
group=('Microlensing', 2005), entries=2, all_are_nan=True
group=('Microlensing', 2006), entries=1, all_are_nan=True
group=('Microlensing', 2008), entries=4, all_are_nan=True
group=('Microlensing', 2009), entries=2, all_are_nan=True
group=('Microlensing', 2010), entries=2, all_are_nan=True
group=('Microlensing', 2011), entries=1, all_are_nan=True
group=('Microlensing', 2012), entries=6, all_are_nan=True
group=('Microlensing', 2013), entries=4, all_are_nan=True
group=('Orbital Brightness Modulation', 2011), entries=2, all_are_nan=True
group=('Orbital Brightness Modulation', 2013), entries=1, all_are_nan=True
group=('Pulsar Timing', 1992), entries=2, all_are_nan=True
group=('Pulsar Timing', 1994), entries=1, all_are_nan=True
group=('Pulsar Timing', 2003), entries=1, all_are_nan=True
group=('Pulsar Timing', 2011), entries=1, all_are_nan=True
group=('Pulsation Timing Variations', 2007), entries=1, all_are_nan=True
group=('Transit', 2002), entries=1, all_are_nan=True
group=('Transit', 2004), entries=5, all_are_nan=True
group=('Transit', 2006), entries=5, all_are_nan=True
group=('Transit', 2007), entries=16, all_are_nan=True
group=('Transit', 2008), entries=17, all_are_nan=True
group=('Transit', 2009), entries=18, all_are_nan=True
group=('Transit', 2010), entries=48, all_are_nan=True
group=('Transit', 2011), entries=80, all_are_nan=True
group=('Transit', 2012), entries=92, all_are_nan=True
group=('Transit', 2014), entries=40, all_are_nan=True
group=('Transit Timing Variations', 2011), entries=1, all_are_nan=True
group=('Transit Timing Variations', 2012), entries=1, all_are_nan=True
group=('Transit Timing Variations', 2013), entries=1, all_are_nan=True
group=('Transit Timing Variations', 2014), entries=1, all_are_nan=True
col=distance
group=('Eclipse Timing Variations', 2009), entries=1, all_are_nan=True
group=('Eclipse Timing Variations', 2011), entries=3, all_are_nan=True
group=('Eclipse Timing Variations', 2012), entries=1, all_are_nan=True
group=('Microlensing', 2004), entries=1, all_are_nan=True
group=('Microlensing', 2005), entries=2, all_are_nan=True
group=('Microlensing', 2006), entries=1, all_are_nan=True
group=('Microlensing', 2008), entries=4, all_are_nan=True
group=('Microlensing', 2009), entries=2, all_are_nan=True
group=('Microlensing', 2010), entries=2, all_are_nan=True
group=('Microlensing', 2011), entries=1, all_are_nan=True
group=('Orbital Brightness Modulation', 2013), entries=1, all_are_nan=True
group=('Pulsar Timing', 1992), entries=2, all_are_nan=True
group=('Pulsar Timing', 1994), entries=1, all_are_nan=True
group=('Pulsar Timing', 2003), entries=1, all_are_nan=True
group=('Pulsation Timing Variations', 2007), entries=1, all_are_nan=True
group=('Transit', 2002), entries=1, all_are_nan=True
group=('Transit Timing Variations', 2014), entries=1, all_are_nan=True
method 0
number 0
orbital_period 28
mass 405
distance 26
year 0
dtype: int64
Possible solution : Consider making your groups larger by removing year
or method
from your group.可能的解决方案:考虑通过从组中删除
year
或method
来扩大您的组。
Try this:尝试这个:
df.fillna(df.groupby(['method', 'year'])[col].transform('mean'))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.