[英]Fill NaN in Pandas DataFrame by selecting value from other DataFrame
I am toying with the Titanic dataset and am trying to fill in the Age Values. 我正在玩泰坦尼克号数据集,并试图填写年龄值。 My dataframes look like:
我的数据框看起来像:
Dataframe df
Survived Pclass Age SibSp Parch Fare male Q S Title
0 0 3 22.0 1 0 7.2500 1 0 1 Mr
1 1 1 38.0 1 0 71.2833 0 0 0 Mrs
2 1 3 26.0 0 0 7.9250 0 0 1 Miss
3 1 1 35.0 1 0 53.1000 0 0 1 Mrs
4 0 3 35.0 0 0 8.0500 1 0 1 Mr
5 0 3 NaN 0 0 8.4583 1 1 0 Mr
And 和
DataFrame age_df
3 1 2
Mr 28.7249 41.5805 32.7683
Mrs 33.5152 40.8824 33.6829
Miss 16.1232 30 22.3906
Master 5.35083 5.30667 2.25889
Don 40 40 40
Rev 43.1667 43.1667 43.1667
Dr 42 43.75 38.5
Mme 24 24 24
Ms 28 28 28
Major 48.5 48.5 48.5
Lady 48 48 48
Sir 49 49 49
Mlle 24 24 24
Col 58 58 58
Capt 70 70 70
Countess 33 33 33
Jonkheer 38 38 38
I want to fill the df['Age']
missing values with the corresponding value from age_df based on df['Title']
and df['Pclass']
我想根据
df['Title']
和df['Pclass']
age_df中的相应值填充df['Age']
缺失值
I've come up with this but none of the NaNs get overwritten. 我想出了这个,但没有一个NaN被覆盖。
for tit in df['Title'].unique():
for cls in [1,2,3]:
df.loc[ (df['Age'].isna() == True) &
(df['Title'] == tit) &
(df['Pclass'] == cls)]['Age'] = age_df.loc[tit][cls]
Furthermore I don't think this should be done with a nested loop. 此外,我认为这不应该使用嵌套循环。 How should I be doing this?
我该怎么做?
One way may be to use apply
with if
and else
condition as below: 一种方法可以是使用
apply
与if
和else
如下条件:
df['Age'] = df.apply(lambda row: age_df.loc[row.Title, row.Pclass]
if pd.isnull(row.Age)
else row.Age, axis=1)
You can use lookup
: 您可以使用
lookup
:
In [75]: s = pd.Series(age_df.lookup(df.Title, df.Pclass), index=df.index)
In [76]: s
Out[76]:
0 28.7249
1 40.8824
2 16.1232
3 40.8824
4 28.7249
5 28.7249
dtype: float64
In [77]: df.Age = df.Age.fillna(s)
In [78]: df.Age
Out[78]:
0 22.0000
1 38.0000
2 26.0000
3 35.0000
4 35.0000
5 28.7249
Name: Age, dtype: float64
Solved by using loc[,]
instead of loc[][]
使用
loc[,]
而不是loc[][]
for tit in df['Title'].unique():
for cls in [1,2,3]:
df.loc[ (df['Age'].isna() == True) &
(df['Title'] == tit) &
(df['Pclass'] == cls), 'Age'] = age_df.loc[tit,cls]
I'm still curious about how it should be done without loop. 我仍然很好奇如何在没有循环的情况下完成它。
You can get rid of one loop by just looping through the smaller number of Pclass
, and then use the titles to map the values. 您可以通过循环遍历较小数量的
Pclass
来摆脱一个循环,然后使用标题来映射值。
for col in age_df:
mask = (df.Age.isnull()) & (df.Pclass==int(col))
df.loc[mask, 'Age'] = df.loc[mask, 'Title'].map(age_df[col])
Survived Pclass Age SibSp Parch Fare male Q S Title
0 0 3 22.0000 1 0 7.2500 1 0 1 Mr
1 1 1 38.0000 1 0 71.2833 0 0 0 Mrs
2 1 3 26.0000 0 0 7.9250 0 0 1 Miss
3 1 1 35.0000 1 0 53.1000 0 0 1 Mrs
4 0 3 35.0000 0 0 8.0500 1 0 1 Mr
5 0 3 28.7249 0 0 8.4583 1 1 0 Mr
You can use melt
to reshape your age_df
to tidy format , then
merge and
fill`. 你可以使用
melt
来重塑你的age_df
到整齐的格式 , then
合并and
填充`。
age_df = age_df.melt('Title', var_name='Pclass')
age_df[:4]
Title Pclass value
0 Mr 3 28.7249
1 Mrs 3 33.5152
2 Miss 3 16.1232
df = df.merge(age_df, how='left')
idx = df.Age.isnull()
df.Age[idx] = df.value[idx]
This is not the shortest approach, but after reading all other answers. 这不是最短的方法,但在阅读了所有其他答案之后。 I still love mine.
我仍然爱我。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.