简体   繁体   中英

Find unique values between two columns

I have been going through various questions, but haven't found one that fits to this case.

I have two columns with emails. The first column(CollectedE) consists of 32000 and the second column(UndE) consists of 14987.

I need to find all emails in the second column, which does not exist in the first column and output them into a completely new column.

I have tried something like this, but that doesn't work because of two different lengths in the columns.

import pandas as pd
import numpy as np
df = pd.read_csv('data.csv', delimiter=";")

df['is_dup'] = df[['CollectedE', 'UndE']].duplicated()
df['dups'] = df.groupby(['CollectedE', 'UndE']).is_dup.transform(np.sum)
# df outputs:
df['is_dup'] =[![enter image description here][1]][1] df[['CollectedE', 'UndE']].duplicated()
df['dups'] = df.groupby(['CollectedE', 'UndE'])

df

Here is a picture of the two columns, if that helps. But it seems like all other cases are about either remove duplicates in one column, delete rows with the same values, find frequencies or similar.

在此处输入图片说明

But I hope you can help. Thank you!

也许pandas.Index.difference可以帮助你。

you can use isin which is quite simple with ~ to invert the operation.

df = pd.DataFrame({'CollectedE' : ['abc@gmail.com','random@google.com'],
             'UndE' : ['abc@gmail.com','unique@googlemail.com']})

df['new_col'] = df[~df['CollectedE'].isin(df['UndE'])]['UndE']

print(df)
          CollectedE                   UndE                new_col
0      abc@gmail.com          abc@gmail.com                    NaN
1  random@google.com  unique@googlemail.com  unique@googlemail.com

Here is a working example using the index difference method and a merge.

df = pd.DataFrame({'column_a':['cat','dog','bird','fish','zebra','snake'],
               'column_b':['leopard','snake','bird','sloth','elephant','dolphin']})

idx1 = pd.Index(df['column_a'])
idx2 = pd.Index(df['column_b'])

x = pd.Series(idx2.difference(idx1), name='non_matching_values')

df.merge(x, how='left', left_on='column_b', right_on=x.values)

column_a    column_b    non_matching_values
0   cat leopard leopard
1   dog snake   NaN
2   bird    bird    NaN
3   fish    sloth   sloth
4   zebra   elephant    elephant
5   snake   dolphin dolphin

Here is something I've implemented. I've utilized right outer join and converted output column in a list and appended it in source dataframe.

#Creating dataframe
df = pd.DataFrame({'col1': ['x', 'y', 'z', 'x1'], 'col2': ['x', 'x2', 'y', np.nan]})

#Applying right join and keeping values which are present in 2nd column only
df2 = pd.merge(df[['col1']], df[['col2']], how = 'outer', left_on = ['col1'], right_on 
= ['col2'], indicator = True)

df2 = df2[df2['_merge'] == 'right_only'][['col2']]

To maintain same length of dataframe, null values are added.

#Creating list and adding it as column in source dataframe
df2_list = df2.append(pd.DataFrame({'col2': [np.nan for x in range(len(df) - 
len(df2))]}))['col2'].to_list()

df['col3'] = df2_list

Output:

df
    col1 col2 col3
0    x    x   x2
1    y   x2  NaN
2    z    y  NaN
3   x1  NaN  NaN

You can convert column of list before as well and extend the list with null values.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM