简体   繁体   中英

In Pandas, how can I patch a dataframe with missing values with values from another dataframe given a similar index?

From Fill in missing row values in pandas dataframe

I have the following dataframe and would like to fill in missing values.

mukey   hzdept_r    hzdepb_r    sandtotal_r silttotal_r
425897      0         61        
425897      61        152          5.3         44.7
425911      0         30           30.1        54.9
425911      30        74           17.7        49.8
425911      74        84        

I want each missing value to be the average of values corresponding to that mukey. In this case, eg the first row missing values will be the average of sandtotal_r and silttotal_r corresponding to mukey==425897. pandas fillna doesn't seem to do the trick. Any help?


While the code works for the sample dataframe in that example, it is failing on the larger dataset I have uploaded here: https://www.dropbox.com/s/w3m0jppnq74op4c/www004.csv?dl=0

import pandas as pd
df = pd.read_csv('www004.csv')
# CSV file is here: https://www.dropbox.com/s/w3m0jppnq74op4c/www004.csv?dl=0
df1 = df.set_index('mukey')
df1.fillna(df.groupby('mukey').mean(),inplace=True)
df1.reset_index()

I get the error: InvalidIndexError. Why is it not working?

Use combine_first . It allows you to patch up the missing data on the left dataframe with the matching data on the right dataframe based on same index.

In this case, df1 is on the left and df2 , the means, as the one on the right.

In [48]: df = pd.read_csv('www004.csv')
    ...: df1 = df.set_index('mukey')
    ...: df2 = df.groupby('mukey').mean()

In [49]: df1.loc[426178,:]
Out[49]: 
        hzdept_r  hzdepb_r  sandtotal_r  silttotal_r  claytotal_r   om_r
mukey                                                                   
426178         0        36          NaN          NaN          NaN  72.50
426178        36        66          NaN          NaN          NaN  72.50
426178        66       152         42.1         37.9           20   0.25

In [50]: df2.loc[426178,:]
Out[50]: 
hzdept_r       34.000000
hzdepb_r       84.666667
sandtotal_r    42.100000
silttotal_r    37.900000
claytotal_r    20.000000
om_r           48.416667
Name: 426178, dtype: float64

In [51]: df3 = df1.combine_first(df2)
    ...: df3.loc[426178,:]
Out[51]: 
        hzdept_r  hzdepb_r  sandtotal_r  silttotal_r  claytotal_r   om_r
mukey                                                                   
426178         0        36         42.1         37.9           20  72.50
426178        36        66         42.1         37.9           20  72.50
426178        66       152         42.1         37.9           20   0.25

Note that the following rows still won't have values in the resulting df3

426162
426163
426174
426174
426255

because they were single rows to begin with, hence, .mean() doesn't mean anything to them (eh, see what I did there?).

The problem is the duplicate index values. When you use df1.fillna(df2) , if you have multiple NaN entries in df1 where both the index and the column label are the same, pandas will get confused when trying to slice df1, and throw that InvalidIndexError .

Your sample dataframe works because even though you have duplicate index values there, only one of each index value is null. Your larger dataframe contains null entries that share both the index value and column label in some cases.

To make this work, you can do this one column at a time. For some reason, when operating on a series, pandas will not get confused by multiple entries of the same index, and will simply fill the same value in each one. Hence, this should work:

import pandas as pd
df = pd.read_csv('www004.csv')
# CSV file is here: https://www.dropbox.com/s/w3m0jppnq74op4c/www004.csv?dl=0
df1 = df.set_index('mukey')
grouped = df.groupby('mukey').mean()
for col in ['sandtotal_r', 'silttotal_r']:
    df1[col] = df1[col].fillna(grouped[col])
df1.reset_index()

NOTE: Be careful using the combine_first method if you ever have "extra" data in the dataframe you're filling from. The combine_first function will include ALL indices from the dataframe you're filling from, even if they're not present in the original dataframe.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM