在熊猫中查找重复项的最快方法

Question

I have a dataframe like this:我有一个这样的数据框：

date             IP                 date_2            IP_2
2020-02-17       81.195.104.48      2020-02-24        219.85.238.142
2020-02-17       83.71.247.175      2020-02-24        187.134.23.124
2020-02-17       83.71.247.175      Nat               NaN

I am trying to get duplicates when comparing IP and IP_2 values.比较IP和IP_2值时，我试图获得重复项。 IP has more rows than IP_2 hence I am checking if IP_2 exsist in IP like so: IP行数比IP_2因此我正在检查IP_2是否存在于IP如下所示：

df['duplicates']=df['IP_2'].isin(df['IP'])

Is there a faster way of getting only the duplicated df rather than adding new column that checks .isin() method?有没有更快的方法只获取重复的df而不是添加检查.isin()方法的新列？ Desired output would be a new dataframe holding only the duplicated values.所需的输出将是一个仅包含重复值的新数据帧。

Thank you for your suggestions.谢谢你的建议。

Answer 1

Set comparison seems to me the fastest way:设置比较在我看来是最快的方法：

set_common = set(df['IP']) & set(df['IP_2'])

PS Another way is to actually play with IP format (ie turn it into integer, then do some kind of comparison, but this apparently would make sense for a very big table). PS 另一种方法是实际使用 IP 格式（即将它转换为整数，然后进行某种比较，但这显然对于一个非常大的表是有意义的）。

在熊猫中查找重复项的最快方法

问题描述

1 个解决方案

解决方案1
2 2020-02-24 14:35:11

在熊猫中查找重复项的最快方法

问题描述

1 个解决方案

解决方案1 2 2020-02-24 14:35:11

解决方案1
2 2020-02-24 14:35:11