Pandas 忽略 NaN 删除重复项

Question

In a Pandas df, I am trying to drop duplicates across multiple columns.在 Pandas df 中，我试图跨多个列删除重复项。 Lots of data per row is NaN .每行的大量数据是NaN 。

This is only an example, the data is a mixed bag, so many different combinations exist.这只是一个例子，数据是一个混合包，存在许多不同的组合。

df.drop_duplicates()

    IDnum       name            formNumber
1   NaN         AP GROUP        028-11964
2   1364615.0   AP GROUP        NaN
3   NaN         AP GROUP        NaN

Hopeful Output:希望Output：

    IDnum       name            formNumber
1   1364615.0   AP GROUP        028-11964

EDIT:编辑：

If the df.drop_duplicates() looks like this, would it change the solution?如果df.drop_duplicates()看起来像这样，它会改变解决方案吗？ : :

df.drop_duplicates()

    IDnum       name            formNumber
0   NaN         AP GROUP        028-11964
1   1364615.0   AP GROUP        028-11964
2   1364615.0   AP GROUP        NaN
3   NaN         AP GROUP        NaN

Answer 1

You can using groupby + first 您可以使用groupby + first

df.groupby('name',as_index=False).first()
Out[206]: 
      name      IDnum formNumber
0  APGROUP  1364615.0  028-11964

Answer 2

You need: 你需要：

df.bfill().ffill().drop_duplicates()

Output: 输出：

IDnum   name    formNumber
0   1364615.0   AP GROUP    028-11964

Answer 3

There are multiple ways we can remove duplicates from a dataframe. few common ways are:我们可以通过多种方式从 dataframe 中删除重复项。几种常见的方式是：

#option 1
df.drop_duplicates()

#option 2
df.groupby(df.columns.tolist()).size()

The major difference between this two options are:这两个选项之间的主要区别是：

option 1 considers NAN values.选项 1 考虑 NAN 值。 for example in your case例如你的情况

 df.drop_duplicates() IDnum name formNumber 0 NaN AP GROUP 028-11964 1 1364615.0 AP GROUP 028-11964 2 1364615.0 AP GROUP NaN 3 NaN AP GROUP NaN

Here index 0,1,2,3 all are unique rows although duplicates exist in some form.这里索引 0、1、2、3 都是唯一的行，尽管以某种形式存在重复项。

option 2 considers only non NAN values and filters the duplicates as explained in first answer by @BENY选项 2 仅考虑非 NAN 值并过滤重复项，如@BENY 在第一个答案中所述

df.groupby('name',as_index=False).first() df.groupby('name',as_index=False).first()

      name    IDnum formNumber
0  APGROUP  1364615.0  028-11964

here in the above case we see only one unique and non duplicated value as group by did not consider NAN's.在上面的例子中，我们只看到一个唯一且非重复的值，因为分组依据没有考虑 NAN。

To better understand this we can do:为了更好地理解这一点，我们可以这样做：

df.drop_duplicates().info()
df.groupby(df.columns.tolist(),as_index=False).first().info()

by running the above code we get different count for "non-null" records.通过运行上面的代码，我们得到了“非空”记录的不同计数。 this explains how many nulls ignored in 2nd option as compared to 1st option.这解释了与第一个选项相比，第二个选项中忽略了多少空值。

Pandas 忽略 NaN 删除重复项

问题描述

3 个解决方案

解决方案1
2 2018-07-06 21:34:30

解决方案2
1 2018-07-06 21:28:27

解决方案3
0 2021-08-26 18:27:54

Pandas 忽略 NaN 删除重复项

问题描述

3 个解决方案

解决方案1 2 2018-07-06 21:34:30

解决方案2 1 2018-07-06 21:28:27

解决方案3 0 2021-08-26 18:27:54

解决方案1
2 2018-07-06 21:34:30

解决方案2
1 2018-07-06 21:28:27

解决方案3
0 2021-08-26 18:27:54