通过从其他DataFrame中选择值，在Pandas DataFrame中填充NaN

Question

我正在玩泰坦尼克号数据集，并试图填写年龄值。 我的数据框看起来像：

 Dataframe df

    Survived  Pclass   Age  SibSp  Parch      Fare  male  Q  S   Title
0           0       3  22.0      1      0    7.2500     1  0  1      Mr
1           1       1  38.0      1      0   71.2833     0  0  0     Mrs
2           1       3  26.0      0      0    7.9250     0  0  1    Miss
3           1       1  35.0      1      0   53.1000     0  0  1     Mrs
4           0       3  35.0      0      0    8.0500     1  0  1      Mr
5           0       3   NaN      0      0    8.4583     1  1  0      Mr

和

DataFrame age_df
                    3        1        2
    Mr        28.7249  41.5805  32.7683
    Mrs       33.5152  40.8824  33.6829
    Miss      16.1232       30  22.3906
    Master    5.35083  5.30667  2.25889
    Don            40       40       40
    Rev       43.1667  43.1667  43.1667
    Dr             42    43.75     38.5
    Mme            24       24       24
    Ms             28       28       28
    Major        48.5     48.5     48.5
    Lady           48       48       48
    Sir            49       49       49
    Mlle           24       24       24
    Col            58       58       58
    Capt           70       70       70
    Countess       33       33       33
    Jonkheer       38       38       38

我想根据df['Title']和df['Pclass'] age_df中的相应值填充df['Age']缺失值

我想出了这个，但没有一个NaN被覆盖。

for tit in df['Title'].unique():
    for cls in [1,2,3]:
        df.loc[ (df['Age'].isna() == True) &
                (df['Title'] == tit) &
                (df['Pclass'] == cls)]['Age'] = age_df.loc[tit][cls]

此外，我认为这不应该使用嵌套循环。 我该怎么做？

Answer 1

一种方法可以是使用apply与if和else如下条件：

df['Age'] = df.apply(lambda row: age_df.loc[row.Title, row.Pclass] 
                                               if pd.isnull(row.Age) 
                                               else row.Age, axis=1)

Answer 2

您可以使用lookup ：

In [75]: s = pd.Series(age_df.lookup(df.Title, df.Pclass), index=df.index)    
In [76]: s
Out[76]: 
0    28.7249
1    40.8824
2    16.1232
3    40.8824
4    28.7249
5    28.7249
dtype: float64

In [77]: df.Age = df.Age.fillna(s)   
In [78]: df.Age
Out[78]: 
0    22.0000
1    38.0000
2    26.0000
3    35.0000
4    35.0000
5    28.7249
Name: Age, dtype: float64

Answer 3

使用loc[,]而不是loc[][]

for tit in df['Title'].unique():
    for cls in [1,2,3]:
        df.loc[ (df['Age'].isna() == True) &
                (df['Title'] == tit) &
                (df['Pclass'] == cls), 'Age'] = age_df.loc[tit,cls]

我仍然很好奇如何在没有循环的情况下完成它。

Answer 4

您可以通过循环遍历较小数量的Pclass来摆脱一个循环，然后使用标题来映射值。

for col in age_df:
    mask = (df.Age.isnull()) & (df.Pclass==int(col))
    df.loc[mask, 'Age'] = df.loc[mask, 'Title'].map(age_df[col])

   Survived  Pclass      Age  SibSp  Parch     Fare  male  Q  S Title
0         0       3  22.0000      1      0   7.2500     1  0  1    Mr
1         1       1  38.0000      1      0  71.2833     0  0  0   Mrs
2         1       3  26.0000      0      0   7.9250     0  0  1  Miss
3         1       1  35.0000      1      0  53.1000     0  0  1   Mrs
4         0       3  35.0000      0      0   8.0500     1  0  1    Mr
5         0       3  28.7249      0      0   8.4583     1  1  0    Mr

Answer 5

你可以使用melt来重塑你的age_df到整齐的格式 , then合并and填充`。

age_df = age_df.melt('Title', var_name='Pclass')
age_df[:4]
    Title   Pclass  value
0   Mr      3       28.7249
1   Mrs     3       33.5152
2   Miss    3       16.1232

df = df.merge(age_df, how='left')
idx = df.Age.isnull()
df.Age[idx] = df.value[idx]

这不是最短的方法，但在阅读了所有其他答案之后。 我仍然爱我。

通过从其他DataFrame中选择值，在Pandas DataFrame中填充NaN

问题描述

5 个解决方案

解决方案1
1 已采纳 2018-05-24 19:25:29

解决方案2
1 2018-05-24 19:28:49

解决方案3
0 2018-05-24 19:19:39

解决方案4
0 2018-05-24 19:19:47

解决方案5
0 2018-05-24 19:44:20

通过从其他DataFrame中选择值，在Pandas DataFrame中填充NaN

问题描述

5 个解决方案

解决方案1 1 已采纳 2018-05-24 19:25:29

解决方案2 1 2018-05-24 19:28:49

解决方案3 0 2018-05-24 19:19:39

解决方案4 0 2018-05-24 19:19:47

解决方案5 0 2018-05-24 19:44:20

解决方案1
1 已采纳 2018-05-24 19:25:29

解决方案2
1 2018-05-24 19:28:49

解决方案3
0 2018-05-24 19:19:39

解决方案4
0 2018-05-24 19:19:47

解决方案5
0 2018-05-24 19:44:20