[英]Merge rows in a Pandas Dataframe filling NaN values and removing duplicates
I'm trying to clean a Python Pandas dataframe
that contains dirty data with "repeated" (but not exactly duplicated) people information.我正在尝试清理包含带有“重复”(但不完全重复)人员信息的脏数据的 Python Pandas dataframe
。
id name name2 name3 email
1 A A A email@gmail.com
1 A NaN NaN NaN
NaN A A B email@gmail.com
NaN A A B email@gmail.com
1 A A B NaN
NaN A A A email@gmail.com
Unfortunately I don't have a clear "primary key" since the column id
is not always set and I have a list of different names ( name
, name2
, name3
) that don't match always (sometime I have the same name
but different name2
).不幸的是,我没有明确的“主键”,因为列id
并不总是设置,而且我有一个不同名称( name
, name2
, name3
)的列表,它们并不总是匹配(有时我有相同的name
但不同name2
)。 I'd like to keep both these information, but removing duplicate rows and " merging " rows in order to remove the maximum number of NaN values, without loosing any king of information.我想保留这两个信息,但删除重复行和“合并”行以删除最大数量的 NaN 值,而不会丢失任何信息之王。
The output should be that: output 应该是:
id name name2 name3 email
1 A A A email@gmail.com
1 A A B email@gmail.com
The second row is given by the merge between第二行由之间的合并给出
NaN A A B email@gmail.com
1 A A B NaN
in the original dataframe.在原 dataframe 中。
(I already tried the solution here: How can I merge duplicate rows and fill the NaN cells with the values from the other row? but without success) (我已经在这里尝试过解决方案: 如何合并重复的行并用另一行的值填充 NaN 单元格?但没有成功)
Thanks.谢谢。
Maybe the example is unclear, but IIUC, ffill
and drop_duplicates
:也许这个例子不清楚,但 IIUC、 ffill
和drop_duplicates
:
out = df.ffill().drop_duplicates()
output: output:
id name name2 name3 email
0 1.0 A A A email@gmail.com
2 1.0 A A B email@gmail.com
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.