简体   繁体   中英

Pandas fillna with DataFrame of values

Accordingly to the docs, the fillna value parameter can be one among the following:

value : scalar, dict, Series, or DataFrame
Value to use to fill holes (e.g. 0), alternately a dict/Series/DataFrame of values specifying which value to use for each index (for a Series) or column (for a DataFrame). (values not in the dict/Series/DataFrame will not be filled). This value cannot be a list.

I have a data frame that looks like:

PassengerId Pclass  Name    Sex Age SibSp   Parch   Ticket  Fare    Cabin   Embarked
0   892 3   Kelly, Mr. James    male    34.5    0   0   330911  7.8292  NaN Q
1   893 3   Wilkes, Mrs. James (Ellen Needs)    female  47.0    1   0   363272  7.0000  NaN S
2   894 2   Myles, Mr. Thomas Francis   male    62.0    0   0   240276  9.6875  NaN Q
3   895 3   Wirz, Mr. Albert    male    27.0    0   0   315154  8.6625  NaN S
4   896 3   Hirvonen, Mrs. Alexander (Helga E Lindqvist)    female  22.0    1   1   3101298 12.2875 NaN S

And that is what I want to do:

  1. NaN Cabin will be filled according to the median value given the Pclass feature value
  2. NaN Age will be filled according to its median value across the data set
  3. NaN Embarked will be filled according to the median value given the Pclass feature value

So after some data manipulation, I got this data frame:

    Pclass  Cabin   Embarked    Ticket
0   1   C   S   50
1   2   F   S   13
2   3   G   S   5

What it says is that for the Pclass == 1 the most common Cabin is C . Given that, in my original data frame df I want to fill every df['Cabin'] == null with C .

This is a small example and I could treat each possible null combination by hand with something as:

df_both[df_both['Pclass'] == 1 & df_both['Cabin'] == np.NaN] = 'C'

However, I wonder if I can use this derived data frame to do all this filling automatic.

Thank you.

If you want to fill all Nan's with something like the median or the mean of the specific column you can do the following.

for median:

df.fillna(df.median())

for mean

df.fillna(df.mean())

see https://pandas.pydata.org/pandas-docs/stable/missing_data.html#filling-with-a-pandasobject for more information.

Edit:

Alternatively you can use a dictionary with specified values. The keys need to map to column names. This way you can also impute for missing values in strings.

df.fillna({'col1':'a','col2': 1})

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM