简体   繁体   English

通过从其他DataFrame中选择值,在Pandas DataFrame中填充NaN

[英]Fill NaN in Pandas DataFrame by selecting value from other DataFrame

I am toying with the Titanic dataset and am trying to fill in the Age Values. 我正在玩泰坦尼克号数据集,并试图填写年龄值。 My dataframes look like: 我的数据框看起来像:

 Dataframe df

    Survived  Pclass   Age  SibSp  Parch      Fare  male  Q  S   Title
0           0       3  22.0      1      0    7.2500     1  0  1      Mr
1           1       1  38.0      1      0   71.2833     0  0  0     Mrs
2           1       3  26.0      0      0    7.9250     0  0  1    Miss
3           1       1  35.0      1      0   53.1000     0  0  1     Mrs
4           0       3  35.0      0      0    8.0500     1  0  1      Mr
5           0       3   NaN      0      0    8.4583     1  1  0      Mr

And

DataFrame age_df
                    3        1        2
    Mr        28.7249  41.5805  32.7683
    Mrs       33.5152  40.8824  33.6829
    Miss      16.1232       30  22.3906
    Master    5.35083  5.30667  2.25889
    Don            40       40       40
    Rev       43.1667  43.1667  43.1667
    Dr             42    43.75     38.5
    Mme            24       24       24
    Ms             28       28       28
    Major        48.5     48.5     48.5
    Lady           48       48       48
    Sir            49       49       49
    Mlle           24       24       24
    Col            58       58       58
    Capt           70       70       70
    Countess       33       33       33
    Jonkheer       38       38       38

I want to fill the df['Age'] missing values with the corresponding value from age_df based on df['Title'] and df['Pclass'] 我想根据df['Title']df['Pclass'] age_df中的相应值填充df['Age']缺失值

I've come up with this but none of the NaNs get overwritten. 我想出了这个,但没有一个NaN被覆盖。

for tit in df['Title'].unique():
    for cls in [1,2,3]:
        df.loc[ (df['Age'].isna() == True) &
                (df['Title'] == tit) &
                (df['Pclass'] == cls)]['Age'] = age_df.loc[tit][cls]

Furthermore I don't think this should be done with a nested loop. 此外,我认为这不应该使用嵌套循环。 How should I be doing this? 我该怎么做?

One way may be to use apply with if and else condition as below: 一种方法可以是使用applyifelse如下条件:

df['Age'] = df.apply(lambda row: age_df.loc[row.Title, row.Pclass] 
                                               if pd.isnull(row.Age) 
                                               else row.Age, axis=1)

You can use lookup : 您可以使用lookup

In [75]: s = pd.Series(age_df.lookup(df.Title, df.Pclass), index=df.index)    
In [76]: s
Out[76]: 
0    28.7249
1    40.8824
2    16.1232
3    40.8824
4    28.7249
5    28.7249
dtype: float64

In [77]: df.Age = df.Age.fillna(s)   
In [78]: df.Age
Out[78]: 
0    22.0000
1    38.0000
2    26.0000
3    35.0000
4    35.0000
5    28.7249
Name: Age, dtype: float64

Solved by using loc[,] instead of loc[][] 使用loc[,]而不是loc[][]

for tit in df['Title'].unique():
    for cls in [1,2,3]:
        df.loc[ (df['Age'].isna() == True) &
                (df['Title'] == tit) &
                (df['Pclass'] == cls), 'Age'] = age_df.loc[tit,cls]

I'm still curious about how it should be done without loop. 我仍然很好奇如何在没有循环的情况下完成它。

You can get rid of one loop by just looping through the smaller number of Pclass , and then use the titles to map the values. 您可以通过循环遍历较小数量的Pclass来摆脱一个循环,然后使用标题来映射值。

for col in age_df:
    mask = (df.Age.isnull()) & (df.Pclass==int(col))
    df.loc[mask, 'Age'] = df.loc[mask, 'Title'].map(age_df[col])

   Survived  Pclass      Age  SibSp  Parch     Fare  male  Q  S Title
0         0       3  22.0000      1      0   7.2500     1  0  1    Mr
1         1       1  38.0000      1      0  71.2833     0  0  0   Mrs
2         1       3  26.0000      0      0   7.9250     0  0  1  Miss
3         1       1  35.0000      1      0  53.1000     0  0  1   Mrs
4         0       3  35.0000      0      0   8.0500     1  0  1    Mr
5         0       3  28.7249      0      0   8.4583     1  1  0    Mr

You can use melt to reshape your age_df to tidy format , then merge and fill`. 你可以使用melt来重塑你的age_df整齐的格式 , then合并and填充`。

age_df = age_df.melt('Title', var_name='Pclass')
age_df[:4]
    Title   Pclass  value
0   Mr      3       28.7249
1   Mrs     3       33.5152
2   Miss    3       16.1232

df = df.merge(age_df, how='left')
idx = df.Age.isnull()
df.Age[idx] = df.value[idx]

This is not the shortest approach, but after reading all other answers. 这不是最短的方法,但在阅读了所有其他答案之后。 I still love mine. 我仍然爱我。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM