In the following code I would like to identify and report values in Col1 that appear in Col2, values in Col2 that appear in Col1 and overall values that appear more than once.
In the example below values AAPL and GOOG appear in Col1 and Col2. These are expected to be identified and reported in next 2 columns, and in the column after that expecting to identify and report whether "any" of Col1 or Col2 values are DUP.
import pandas as pd
import numpy as np
data={'Col1':['AAPL', np.nan, 'GOOG', 'MMM', np.nan, 'INTC', 'FB'],'Col2':['GOOG', 'IBM', 'MSFT', np.nan, 'GOOG', 'AAPL', 'VZ']}
df=pd.DataFrame(data,columns=['Col1','Col2'])
print (df)
# How to code after this to produce expected result?
# Appreciate any hint/help provided
Here is a solution for you that works with the code above. It just uses some for loops with itterows(). Nothing fancy.
df['Col3'] = False
df['Col4'] = False
df['Col5'] = False
for i,row in df.iterrows():
if df.loc[i,'Col1'] in (df.Col2.values):
df.loc[i,'Col3'] = True
for i,row in df.iterrows():
if df.loc[i,'Col2'] in (df.Col1.values):
df.loc[i,'Col4'] = True
for i,row in df.iterrows():
if df.loc[i,'Col3'] | df.loc[i,'Col4'] == True:
df.loc[i,'Col5'] = True
Use numpy where
to check if one column value is in another, and then boolean OR the columns to check if it's a dupe.
df['Col1inCol2']=np.where(df.Col1.isin(df.Col2) & ~df.Col1.isnull(), True, False)
df['Col2inCol1']=np.where(df.Col2.isin(df.Col1) & ~df.Col2.isnull(), True, False)
df['Dupe']= df.Col1inCol2 | df.Col2inCol1
Col1 Col2 Col1inCol2 Col2inCol1 Dupe
0 AAPL GOOG True True True
1 NaN IBM False False False
2 GOOG MSFT True False True
3 MMM NaN False False False
4 NaN GOOG False True True
5 INTC AAPL False True True
6 FB VZ False False False
Following is the final script:
##############################################################################
# Code to identify and report duplicates across columns
# np.nan values are handled
# Date: 04-JUL-2018
# Posted by: Salil V Gangal
# Forum: Stack OverFlow
##############################################################################
import pandas as pd
import numpy as np
data={'Col1':['AAPL', np.nan, 'GOOG', 'MMM', np.nan, 'INTC', 'FB'],'Col2':['GOOG', 'IBM', 'MSFT', np.nan, 'GOOG', 'AAPL', 'VZ']}
df=pd.DataFrame(data,columns=['Col1','Col2'])
print ("Initial DataFrame\n")
print (df)
pd.set_option("display.max_rows",999)
pd.set_option("display.max_columns",999)
df['Col1_val_exists_in_Col2'] = False
df['Col2_val_exists_in_Col1'] = False
df['Dup_in_Frame'] = False
for i,row in df.iterrows():
if df.loc[i,'Col1'] in (df.Col2.values):
df.loc[i,'Col1_val_exists_in_Col2'] = True
for i,row in df.iterrows():
if df.loc[i,'Col2'] in (df.Col1.values):
df.loc[i,'Col2_val_exists_in_Col1'] = True
for i,row in df.iterrows():
if df.loc[i,'Col1_val_exists_in_Col2'] | df.loc[i,'Col2_val_exists_in_Col1'] == True:
df.loc[i,'Dup_in_Frame'] = True
print ("Final DataFrame\n")
print (df)
Another way of doing the task is given below - thanks to "skrubber":
##############################################################################
# Code to identify and report duplicates across columns
# np.nan values are handled
# Date: 05-JUL-2018
# Posted by: Salil V Gangal
# Forum: Stack OverFlow
##############################################################################
import pandas as pd
import numpy as np
data={
'Col1':
['AAPL', np.nan, 'GOOG', 'MMM', np.nan, 'INTC', 'FB'],
'Col2':
['GOOG', 'IBM', 'MSFT', np.nan, 'GOOG', 'AAPL', 'VZ']
}
df=pd.DataFrame(data,columns=['Col1','Col2'])
print ("\n\nInitial DataFrame\n")
print (df)
pd.set_option("display.max_rows",999)
pd.set_option("display.max_columns",999)
df['Col1_val_exists_in_Col2'] = np.where(df.Col1.isin(df.Col2) & ~df.Col1.isnull(), True, False)
df['Col2_val_exists_in_Col1'] = np.where(df.Col2.isin(df.Col1) & ~df.Col2.isnull(), True, False)
df['Dupe'] = df.Col1_val_exists_in_Col2 | df.Col2_val_exists_in_Col1
print ("\n\nFinal DataFrame\n")
print (df)
Initial DataFrame
Col1 Col2
0 AAPL GOOG
1 NaN IBM
2 GOOG MSFT
3 MMM NaN
4 NaN GOOG
5 INTC AAPL
6 FB VZ
Final DataFrame
Col1 Col2 Col1_val_exists_in_Col2 Col2_val_exists_in_Col1 Dupe
0 AAPL GOOG True True True
1 NaN IBM False False False
2 GOOG MSFT True False True
3 MMM NaN False False False
4 NaN GOOG False True True
5 INTC AAPL False True True
6 FB VZ False False False
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.