[英]How to complete NaN cells based on another Pandas dataframe in Python
I have the following 2 dataframes..我有以下2个数据框..
First dataframe df1 :首先 dataframe df1 :
import pandas as pd
import numpy as np
d1 = {'id': [1, 2, 3, 4], 'col1': [13, np.nan, 15, np.nan], 'col2': [23, np.nan, np.nan, np.nan]}
df1 = pd.DataFrame(data=d1)
df1
id col1 col2
0 1 13.0 23.0
1 2 NaN NaN
2 3 15.0 NaN
3 4 NaN NaN
And the second dataframe df2 :第二个 dataframe df2 :
d2 = {'id': [2, 3, 4], 'col1': [ 14, 150, 16], 'col2': [24, 250, np.nan]}
df2 = pd.DataFrame(data=d2)
df2
id col1 col2
0 2 14 24.0
1 3 150 250.0
2 4 16 NaN
I need to replace the NaN fields in df1 with the non-NaN values from df2 , where it is possible.我需要将df1中的NaN字段替换为df2中的非 NaN值,如果可能的话。 But there are some conditions...但是有一些条件...
Condition 1) id column in each dataframe consists of unique values.条件 1)每个 dataframe 中的id列由唯一值组成。 When replacing any NaN value in df1 with another value from df2 , the id column value needs to match.将df1中的任何 NaN 值替换为df2中的另一个值时, id列值需要匹配。
Condition 2) Dataframes do not necessarily have the same size.条件 2)数据帧不一定具有相同的大小。
Condition 3) NaN values will only be looked for in col1 or col2 in any of the dataframes.条件 3) NaN 值只会在任何数据帧的col1或col2中查找。 The id column cannot be NaN in any row. id列在任何行中都不能是 NaN。 There might be other columns in the dataframes, with or without NaN values.数据框中可能还有其他列,有或没有 NaN 值。 But for replacing the data, we will only be looking at col1 and col2 columns.但是为了替换数据,我们只会查看col1和col2列。
Condition 4) To go for a replacement of a row in df1 , it is enough that any of col1 or col2 have a NaN value in any corresponding row.条件 4)到 go 替换df1中的一行, col1或col2中的任何一个在任何相应的行中都有一个 NaN 值就足够了。 And when any NaN value is detected in any row in df1 , the entire row will be replaced by the corresponding row with the same id value from df2 , as long as all values of col1 and col2 in the corresponding row of df2 are non-NaN .并且当在df1的任何行中检测到任何 NaN 值时,只要df2对应行中 col1 和 col2 的所有值都是非 NaN ,整行将被df2中具有相同id值的对应行替换. With other words, if the row with the same id value in df2 have NaN value in any of col1 or col2 , do not replace any data in df1 .换句话说,如果df2中具有相同 id 值的行在col1或col2中的任何一个中具有 NaN 值,则不要替换df1中的任何数据。
After doing this operation, the df1 should look like the following:执行此操作后, df1应如下所示:
id col1 col2
0 1 13.0 23.0
1 2 14 24
2 3 150.0 250.0 # Note that the entire row is replaced!
3 4 NaN NaN # This row not replaced bcz col2 value is NaN in df2 for the same row
How can this be done in the most elegant way?如何以最优雅的方式做到这一点? Python offers a lot of functions that I may not be aware of, which maybe solves this problem in a few rows instead of writing a very complex logic. Python 提供了很多我可能不知道的功能,这可能会在几行中解决这个问题,而不是编写非常复杂的逻辑。
You can drop the NaN
values from df2
, then update with concat
and groupby
:您可以从df2
中删除NaN
值,然后使用concat
和groupby
进行更新:
pd.concat([df2.dropna(), df1]).groupby('id', as_index=False).first()
Output: Output:
id col1 col2
0 1 13.0 23.0
1 2 14.0 24.0
2 3 150.0 250.0
3 4 NaN NaN
here is another way using fillna
:这是使用fillna
的另一种方式:
df1 = df1.set_index('id').fillna(df2.dropna().set_index('id')).reset_index()
output: output:
>>>
id col1 col2
0 1 13.0 23.0
1 2 14.0 24.0
2 3 15.0 250.0
3 4 NaN NaN
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.