简体   繁体   English

Pandas 忽略 NaN 删除重复项

[英]Pandas drop duplicates ignoring NaN

In a Pandas df, I am trying to drop duplicates across multiple columns.在 Pandas df 中,我试图跨多个列删除重复项。 Lots of data per row is NaN .每行的大量数据是NaN

This is only an example, the data is a mixed bag, so many different combinations exist.这只是一个例子,数据是一个混合包,存在许多不同的组合。

df.drop_duplicates()

    IDnum       name            formNumber
1   NaN         AP GROUP        028-11964
2   1364615.0   AP GROUP        NaN
3   NaN         AP GROUP        NaN

Hopeful Output:希望Output:

    IDnum       name            formNumber
1   1364615.0   AP GROUP        028-11964

EDIT:编辑:

If the df.drop_duplicates() looks like this, would it change the solution?如果df.drop_duplicates()看起来像这样,它会改变解决方案吗? : :

df.drop_duplicates()

    IDnum       name            formNumber
0   NaN         AP GROUP        028-11964
1   1364615.0   AP GROUP        028-11964
2   1364615.0   AP GROUP        NaN
3   NaN         AP GROUP        NaN

You can using groupby + first 您可以使用groupby + first

df.groupby('name',as_index=False).first()
Out[206]: 
      name      IDnum formNumber
0  APGROUP  1364615.0  028-11964

You need: 你需要:

df.bfill().ffill().drop_duplicates()

Output: 输出:

IDnum   name    formNumber
0   1364615.0   AP GROUP    028-11964

There are multiple ways we can remove duplicates from a dataframe. few common ways are:我们可以通过多种方式从 dataframe 中删除重复项。几种常见的方式是:

#option 1
df.drop_duplicates()

#option 2
df.groupby(df.columns.tolist()).size()

The major difference between this two options are:这两个选项之间的主要区别是:

  1. option 1 considers NAN values.选项 1 考虑 NAN 值。 for example in your case例如你的情况

     df.drop_duplicates() IDnum name formNumber 0 NaN AP GROUP 028-11964 1 1364615.0 AP GROUP 028-11964 2 1364615.0 AP GROUP NaN 3 NaN AP GROUP NaN

Here index 0,1,2,3 all are unique rows although duplicates exist in some form.这里索引 0、1、2、3 都是唯一的行,尽管以某种形式存在重复项。

  1. option 2 considers only non NAN values and filters the duplicates as explained in first answer by @BENY选项 2 仅考虑非 NAN 值并过滤重复项,如@BENY 在第一个答案中所述

df.groupby('name',as_index=False).first() df.groupby('name',as_index=False).first()

      name    IDnum formNumber
0  APGROUP  1364615.0  028-11964

here in the above case we see only one unique and non duplicated value as group by did not consider NAN's.在上面的例子中,我们只看到一个唯一且非重复的值,因为分组依据没有考虑 NAN。

To better understand this we can do:为了更好地理解这一点,我们可以这样做:

df.drop_duplicates().info()
df.groupby(df.columns.tolist(),as_index=False).first().info()

by running the above code we get different count for "non-null" records.通过运行上面的代码,我们得到了“非空”记录的不同计数。 this explains how many nulls ignored in 2nd option as compared to 1st option.这解释了与第一个选项相比,第二个选项中忽略了多少空值。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM