简体   繁体   中英

Pandas fillna from mean with groupby for multiple columns

I am trying to groupby multiple columns and fillna multiple columns at the same time. I am attaching a picture of what the data looks like as well as my code that I am having issues with. This is sample data that I have created that reflects that actual data, as it is confidential.

There are 4 columns: name, plant, length and width. There are 3 different types of plant. There is missing data for each of the last 3. My end goal is to create a model to guess which plant types are missing. But to do that, I am first attempting to impute the mean of the length and width for each name/plant combination into the missing values for them.

The below shows an example of calculating the means which is working, where I am failing is inserting them to fill the na values.

lengthmean = df.groupby(['name', 'plant']).length.mean()
print(lengthmean)

I get a results that looks like this

name    plant  

Brian   plant 3    2.500000
        plant1     1.850000
        plant2     2.450000
Jeff    plant 3    4.100000
        plant1     2.333333
        plant2     2.100000
Justin  plant 3    2.900000
        plant1     1.900000
        plant2     2.850000
Zach    plant 3    1.750000
        plant1     2.650000
        plant2     3.300000

I am also attempting to do multiple columns at once (both length and width in this case, but in my real data it is more than that). Below is the code that is failing for me.

df[['length','width']] = df.groupby(['name', 'plant'])['length','width']\
    .transform(lambda x: x.fillna(x.mean()))

I am receiving this error 'ValueError: Length mismatch: Expected axis has 32 elements, new values have 40 elements'

I would appreciate any help, thank you!

example of data

Thanks for providing a sample data, that really helps!

Looks like the issue is due to your plant column having NaNs . When I run your code df[['length','width']] = df.groupby(['name', 'plant'])['length','width']\\ .transform(lambda x: x.fillna(x.mean())) on the dataset, I do get your error message.

When I remove nulls in the plant column, it works fine:

df = df.dropna(subset=['plant'])
df_cleaned[['length','width']] = df_cleaned.groupby(['name', 'plant'])['length','width']\
    .transform(lambda x: x.fillna(x.mean()))

You'll need to figure out what you want to do with the empty plant column, if you want to fill it up/drop it/add a new plant value/etc.

hope that helps!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM