简体   繁体   English

Pandas fillna from mean with groupby for multiple columns

[英]Pandas fillna from mean with groupby for multiple columns

I am trying to groupby multiple columns and fillna multiple columns at the same time.我正在尝试对多列进行分组并同时填充多列。 I am attaching a picture of what the data looks like as well as my code that I am having issues with.我附上了一张关于数据是什么样子的图片以及我遇到问题的代码。 This is sample data that I have created that reflects that actual data, as it is confidential.这是我创建的示例数据,反映了实际数据,因为它是机密的。

There are 4 columns: name, plant, length and width.有 4 列:名称、植物、长度和宽度。 There are 3 different types of plant.有3种不同类型的植物。 There is missing data for each of the last 3. My end goal is to create a model to guess which plant types are missing.最后 3 个中的每一个都有缺失的数据。我的最终目标是创建一个模型来猜测哪些植物类型缺失。 But to do that, I am first attempting to impute the mean of the length and width for each name/plant combination into the missing values for them.但要做到这一点,我首先尝试将每个名称/植物组合的长度和宽度的平均值归入它们的缺失值中。

The below shows an example of calculating the means which is working, where I am failing is inserting them to fill the na values.下面显示了计算有效均值的示例,我失败的地方是插入它们以填充 na 值。

lengthmean = df.groupby(['name', 'plant']).length.mean()
print(lengthmean)

I get a results that looks like this我得到一个看起来像这样的结果

name    plant  

Brian   plant 3    2.500000
        plant1     1.850000
        plant2     2.450000
Jeff    plant 3    4.100000
        plant1     2.333333
        plant2     2.100000
Justin  plant 3    2.900000
        plant1     1.900000
        plant2     2.850000
Zach    plant 3    1.750000
        plant1     2.650000
        plant2     3.300000

I am also attempting to do multiple columns at once (both length and width in this case, but in my real data it is more than that).我也试图一次做多列(在这种情况下是长度和宽度,但在我的真实数据中它不止于此)。 Below is the code that is failing for me.下面是对我来说失败的代码。

df[['length','width']] = df.groupby(['name', 'plant'])['length','width']\
    .transform(lambda x: x.fillna(x.mean()))

I am receiving this error 'ValueError: Length mismatch: Expected axis has 32 elements, new values have 40 elements'我收到此错误'ValueError: Length mismatch: Expected axis has 32 elements, new values have 40 elements'

I would appreciate any help, thank you!我将不胜感激任何帮助,谢谢!

example of data数据示例

Thanks for providing a sample data, that really helps!感谢您提供示例数据,这真的很有帮助!

Looks like the issue is due to your plant column having NaNs .看起来问题是由于您的plant列具有NaNs When I run your code df[['length','width']] = df.groupby(['name', 'plant'])['length','width']\\ .transform(lambda x: x.fillna(x.mean())) on the dataset, I do get your error message.当我运行你的代码df[['length','width']] = df.groupby(['name', 'plant'])['length','width']\\ .transform(lambda x: x.fillna(x.mean()))在数据集上,我确实收到了您的错误消息。

When I remove nulls in the plant column, it works fine:当我删除plant列中的空值时,它工作正常:

df = df.dropna(subset=['plant'])
df_cleaned[['length','width']] = df_cleaned.groupby(['name', 'plant'])['length','width']\
    .transform(lambda x: x.fillna(x.mean()))

You'll need to figure out what you want to do with the empty plant column, if you want to fill it up/drop it/add a new plant value/etc.你需要弄清楚你想用空的植物列做什么,如果你想填充它/删除它/添加一个新的植物值/等等。

hope that helps!希望有帮助!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM