简体   繁体   English

查找两列之间的唯一值

[英]Find unique values between two columns

I have been going through various questions, but haven't found one that fits to this case.我一直在经历各种问题,但还没有找到适合这种情况的问题。

I have two columns with emails.我有两列电子邮件。 The first column(CollectedE) consists of 32000 and the second column(UndE) consists of 14987.第一列 (CollectedE) 由 32000 组成,第二列 (UndE) 由 14987 组成。

I need to find all emails in the second column, which does not exist in the first column and output them into a completely new column.我需要在第一列中不存在的第二列中找到所有电子邮件,并将它们输出到一个全新的列中。

I have tried something like this, but that doesn't work because of two different lengths in the columns.我已经尝试过这样的事情,但由于列中有两种不同的长度,这不起作用。

import pandas as pd
import numpy as np
df = pd.read_csv('data.csv', delimiter=";")

df['is_dup'] = df[['CollectedE', 'UndE']].duplicated()
df['dups'] = df.groupby(['CollectedE', 'UndE']).is_dup.transform(np.sum)
# df outputs:
df['is_dup'] =[![enter image description here][1]][1] df[['CollectedE', 'UndE']].duplicated()
df['dups'] = df.groupby(['CollectedE', 'UndE'])

df

Here is a picture of the two columns, if that helps.这是两列的图片,如果有帮助的话。 But it seems like all other cases are about either remove duplicates in one column, delete rows with the same values, find frequencies or similar.但似乎所有其他情况都是关于删除一列中的重复项、删除具有相同值的行、查找频率或类似的。

在此处输入图片说明

But I hope you can help.但我希望你能帮忙。 Thank you!谢谢!

也许pandas.Index.difference可以帮助你。

you can use isin which is quite simple with ~ to invert the operation.您可以使用很简单的isin~来反转操作。

df = pd.DataFrame({'CollectedE' : ['abc@gmail.com','random@google.com'],
             'UndE' : ['abc@gmail.com','unique@googlemail.com']})

df['new_col'] = df[~df['CollectedE'].isin(df['UndE'])]['UndE']

print(df)
          CollectedE                   UndE                new_col
0      abc@gmail.com          abc@gmail.com                    NaN
1  random@google.com  unique@googlemail.com  unique@googlemail.com

Here is a working example using the index difference method and a merge.这是一个使用索引差异方法和合并的工作示例。

df = pd.DataFrame({'column_a':['cat','dog','bird','fish','zebra','snake'],
               'column_b':['leopard','snake','bird','sloth','elephant','dolphin']})

idx1 = pd.Index(df['column_a'])
idx2 = pd.Index(df['column_b'])

x = pd.Series(idx2.difference(idx1), name='non_matching_values')

df.merge(x, how='left', left_on='column_b', right_on=x.values)

column_a    column_b    non_matching_values
0   cat leopard leopard
1   dog snake   NaN
2   bird    bird    NaN
3   fish    sloth   sloth
4   zebra   elephant    elephant
5   snake   dolphin dolphin

Here is something I've implemented.这是我实现的东西。 I've utilized right outer join and converted output column in a list and appended it in source dataframe.我在列表中使用了右外连接和转换的输出列,并将其附加到源数据框中。

#Creating dataframe
df = pd.DataFrame({'col1': ['x', 'y', 'z', 'x1'], 'col2': ['x', 'x2', 'y', np.nan]})

#Applying right join and keeping values which are present in 2nd column only
df2 = pd.merge(df[['col1']], df[['col2']], how = 'outer', left_on = ['col1'], right_on 
= ['col2'], indicator = True)

df2 = df2[df2['_merge'] == 'right_only'][['col2']]

To maintain same length of dataframe, null values are added.为了保持相同长度的数据帧,添加了空值。

#Creating list and adding it as column in source dataframe
df2_list = df2.append(pd.DataFrame({'col2': [np.nan for x in range(len(df) - 
len(df2))]}))['col2'].to_list()

df['col3'] = df2_list

Output:输出:

df
    col1 col2 col3
0    x    x   x2
1    y   x2  NaN
2    z    y  NaN
3   x1  NaN  NaN

You can convert column of list before as well and extend the list with null values.您也可以之前转换列表列并使用空值扩展列表。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM