如何检测重复项，然后在其中交叉检查两列是否具有相似的值？

Question

所以我有一个像这样的数据框

 No    fname        sname        landline        address
 1   Alphred      Thomas         123              A
 2   Peter        Jay            345              B
 3   Donald       Hook           123              A
 4   Jay          Donald         345              B
 5   Jay          Donald         123              A
 6   Haskell      Peter          123              B

现在，我想将座机和地址的所有重复项放在一起。 因此，在上述情况下，组（123，A）将是一组重复实体，而（345，B）将是另一组重复实体。 我想忽略（123，B），因为这只会发生一次。

现在，对于每个重复的组，我想检查fnmae和sname列中是否都出现一个名称。 因此，对于（123，A），我们要捕获唐纳德同时出现在fname和sname上的行（基本上它们必须是两个不同的行，而两列必须具有相似的名称）在上面，我们将选择第3行和第5行。在选择此行之后，我想对此执行更多操作。 输入姓名的日期进行检查。

我该如何实现？ 我尝试使用重复，但这对第二次比较没有太大帮助吗？

Answer 1

您可以将groupby与isin用作掩码，然后使用boolean indexing ：

mask = df.groupby(['landline','address']).apply(lambda x: x.fname.isin(x.sname) | 
                                                          x.sname.isin(x.fname) & 
                                                            (len(x) > 1))
mask = mask.reset_index(level=['landline','address'], drop=True).sort_index()
print (mask)
0    False
1     True
2     True
3     True
4     True
5    False
dtype: bool

df1 = df[mask]
print (df1)
   No   fname   sname  landline address
1   2   Peter     Jay       345       B
2   3  Donald    Hook       123       A
3   4     Jay  Donald       345       B
4   5     Jay  Donald       123       A

编辑：我认为您可以使用自定义函数与filtering ：

def f(x):
    print (x)
    mask = x.fname.isin(x.sname) | x.sname.isin(x.fname) & (len(x) > 1)
    x1 = x[mask]
    return x1


df2 = df.groupby(['landline','address']).apply(f).reset_index(drop=True)
print (df2)
   No   fname   sname  landline address
0   3  Donald    Hook       123       A
1   5     Jay  Donald       123       A
2   2   Peter     Jay       345       B
3   4     Jay  Donald       345       B

如何检测重复项，然后在其中交叉检查两列是否具有相似的值？

问题描述

1 个解决方案

解决方案1
2 已采纳 2017-02-27 10:13:45

如何检测重复项，然后在其中交叉检查两列是否具有相似的值？

问题描述

1 个解决方案

解决方案1 2 已采纳 2017-02-27 10:13:45

解决方案1
2 已采纳 2017-02-27 10:13:45