简体   繁体   中英

Python/Pandas: Identify Duplicates across Columns

In the following code I would like to identify and report values in Col1 that appear in Col2, values in Col2 that appear in Col1 and overall values that appear more than once.

In the example below values AAPL and GOOG appear in Col1 and Col2. These are expected to be identified and reported in next 2 columns, and in the column after that expecting to identify and report whether "any" of Col1 or Col2 values are DUP.

import pandas as pd
import numpy as np
data={'Col1':['AAPL', np.nan, 'GOOG', 'MMM', np.nan, 'INTC', 'FB'],'Col2':['GOOG', 'IBM', 'MSFT', np.nan, 'GOOG', 'AAPL', 'VZ']}
df=pd.DataFrame(data,columns=['Col1','Col2'])
print (df)

# How to code after this to produce expected result?
# Appreciate any hint/help provided

这就是结果在Excel中的显示方式

Here is a solution for you that works with the code above. It just uses some for loops with itterows(). Nothing fancy.

df['Col3'] = False
df['Col4'] = False
df['Col5'] = False

for i,row in df.iterrows():
  if df.loc[i,'Col1'] in (df.Col2.values):
     df.loc[i,'Col3'] = True

for i,row in df.iterrows():
  if df.loc[i,'Col2'] in (df.Col1.values):
     df.loc[i,'Col4'] = True

for i,row in df.iterrows():
  if df.loc[i,'Col3'] | df.loc[i,'Col4'] == True:
     df.loc[i,'Col5'] = True

Click here to view image of result

Use numpy where to check if one column value is in another, and then boolean OR the columns to check if it's a dupe.

df['Col1inCol2']=np.where(df.Col1.isin(df.Col2) & ~df.Col1.isnull(), True, False)
df['Col2inCol1']=np.where(df.Col2.isin(df.Col1) & ~df.Col2.isnull(), True, False)
df['Dupe']= df.Col1inCol2 | df.Col2inCol1



    Col1    Col2    Col1inCol2  Col2inCol1  Dupe
0   AAPL    GOOG    True            True    True
1   NaN     IBM     False           False   False
2   GOOG    MSFT    True            False   True
3   MMM     NaN     False           False   False
4   NaN     GOOG    False           True    True
5   INTC    AAPL    False           True    True
6   FB       VZ     False           False   False

Following is the final script:

##############################################################################
# Code to identify and report duplicates across columns
# np.nan values are handled
# Date: 04-JUL-2018
# Posted by: Salil V Gangal
# Forum: Stack OverFlow
##############################################################################

import pandas as pd
import numpy as np
data={'Col1':['AAPL', np.nan, 'GOOG', 'MMM', np.nan, 'INTC', 'FB'],'Col2':['GOOG', 'IBM', 'MSFT', np.nan, 'GOOG', 'AAPL', 'VZ']}
df=pd.DataFrame(data,columns=['Col1','Col2'])
print ("Initial DataFrame\n")
print (df)

pd.set_option("display.max_rows",999)
pd.set_option("display.max_columns",999)


df['Col1_val_exists_in_Col2'] = False
df['Col2_val_exists_in_Col1'] = False
df['Dup_in_Frame'] = False

for i,row in df.iterrows():
  if df.loc[i,'Col1'] in (df.Col2.values):
     df.loc[i,'Col1_val_exists_in_Col2'] = True

for i,row in df.iterrows():
  if df.loc[i,'Col2'] in (df.Col1.values):
     df.loc[i,'Col2_val_exists_in_Col1'] = True

for i,row in df.iterrows():
  if df.loc[i,'Col1_val_exists_in_Col2'] | df.loc[i,'Col2_val_exists_in_Col1'] == True:
     df.loc[i,'Dup_in_Frame'] = True

print ("Final DataFrame\n")
print (df)

Another way of doing the task is given below - thanks to "skrubber":

##############################################################################
# Code to identify and report duplicates across columns
# np.nan values are handled
# Date: 05-JUL-2018
# Posted by: Salil V Gangal
# Forum: Stack OverFlow
##############################################################################

import pandas as pd
import numpy as np
data={ 
       'Col1':
              ['AAPL', np.nan, 'GOOG', 'MMM', np.nan, 'INTC', 'FB'],
       'Col2':
              ['GOOG', 'IBM', 'MSFT', np.nan, 'GOOG', 'AAPL', 'VZ']
     }
df=pd.DataFrame(data,columns=['Col1','Col2'])
print ("\n\nInitial DataFrame\n")
print (df)

pd.set_option("display.max_rows",999)
pd.set_option("display.max_columns",999)

df['Col1_val_exists_in_Col2'] = np.where(df.Col1.isin(df.Col2) & ~df.Col1.isnull(), True, False)
df['Col2_val_exists_in_Col1'] = np.where(df.Col2.isin(df.Col1) & ~df.Col2.isnull(), True, False)
df['Dupe'] = df.Col1_val_exists_in_Col2 | df.Col2_val_exists_in_Col1


print ("\n\nFinal DataFrame\n")
print (df)


Initial DataFrame

   Col1  Col2
0  AAPL  GOOG
1   NaN   IBM
2  GOOG  MSFT
3   MMM   NaN
4   NaN  GOOG
5  INTC  AAPL
6    FB    VZ


Final DataFrame

   Col1  Col2  Col1_val_exists_in_Col2  Col2_val_exists_in_Col1   Dupe
0  AAPL  GOOG                     True                     True   True
1   NaN   IBM                    False                    False  False
2  GOOG  MSFT                     True                    False   True
3   MMM   NaN                    False                    False  False
4   NaN  GOOG                    False                     True   True
5  INTC  AAPL                    False                     True   True
6    FB    VZ                    False                    False  False

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM