[英]Pandas drop duplicates ignoring NaN
In a Pandas df, I am trying to drop duplicates across multiple columns.在 Pandas df 中,我试图跨多个列删除重复项。 Lots of data per row is
NaN
.每行的大量数据是
NaN
。
This is only an example, the data is a mixed bag, so many different combinations exist.这只是一个例子,数据是一个混合包,存在许多不同的组合。
df.drop_duplicates()
IDnum name formNumber
1 NaN AP GROUP 028-11964
2 1364615.0 AP GROUP NaN
3 NaN AP GROUP NaN
Hopeful Output:希望Output:
IDnum name formNumber
1 1364615.0 AP GROUP 028-11964
EDIT:编辑:
If the df.drop_duplicates()
looks like this, would it change the solution?如果
df.drop_duplicates()
看起来像这样,它会改变解决方案吗? : :
df.drop_duplicates()
IDnum name formNumber
0 NaN AP GROUP 028-11964
1 1364615.0 AP GROUP 028-11964
2 1364615.0 AP GROUP NaN
3 NaN AP GROUP NaN
You can using groupby
+ first
您可以使用
groupby
+ first
df.groupby('name',as_index=False).first()
Out[206]:
name IDnum formNumber
0 APGROUP 1364615.0 028-11964
You need: 你需要:
df.bfill().ffill().drop_duplicates()
Output: 输出:
IDnum name formNumber
0 1364615.0 AP GROUP 028-11964
There are multiple ways we can remove duplicates from a dataframe. few common ways are:我们可以通过多种方式从 dataframe 中删除重复项。几种常见的方式是:
#option 1
df.drop_duplicates()
#option 2
df.groupby(df.columns.tolist()).size()
The major difference between this two options are:这两个选项之间的主要区别是:
option 1 considers NAN values.选项 1 考虑 NAN 值。 for example in your case
例如你的情况
df.drop_duplicates() IDnum name formNumber 0 NaN AP GROUP 028-11964 1 1364615.0 AP GROUP 028-11964 2 1364615.0 AP GROUP NaN 3 NaN AP GROUP NaN
Here index 0,1,2,3 all are unique rows although duplicates exist in some form.这里索引 0、1、2、3 都是唯一的行,尽管以某种形式存在重复项。
df.groupby('name',as_index=False).first() df.groupby('name',as_index=False).first()
name IDnum formNumber
0 APGROUP 1364615.0 028-11964
here in the above case we see only one unique and non duplicated value as group by did not consider NAN's.在上面的例子中,我们只看到一个唯一且非重复的值,因为分组依据没有考虑 NAN。
To better understand this we can do:为了更好地理解这一点,我们可以这样做:
df.drop_duplicates().info()
df.groupby(df.columns.tolist(),as_index=False).first().info()
by running the above code we get different count for "non-null" records.通过运行上面的代码,我们得到了“非空”记录的不同计数。 this explains how many nulls ignored in 2nd option as compared to 1st option.
这解释了与第一个选项相比,第二个选项中忽略了多少空值。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.