简体   繁体   中英

How to compare 2 CSV files

I have 2 CSV files:

CSV 1 - original_names.csv

Serial,Names
1,James
2,Stephen
3,Ben
4,Harry
5,Jack
6, Peter

CSV 2 - dup_names.csv

Serial,Names
1,James
2,Kate
3,Ben
4,Sara


Desired Output - new.csv

Serial,Names,flag
1,0,T
2,Kate,F
3,0,T
4,Sara,F
5,Jack,F
6,Peter,F

As you can see, the same names in both CSV will be updated to 0 if names matches to new.csv.

This is what I've tried:

import pandas as pd

df1 = pd.read_csv('original_names.csv')
df2 = pd.read_csv('dup_names.csv')

out = df1.merge(df2['names'], how='inner', on = 'names')

# some code

out.to_csv("new.csv", index=False)


Thank you for your time:)

Do an outer join, then just add some logic here. If the 2 name columns match, put a 'T' flag in, else put 'F' . Then replace the 'names' should be 0 is 'T' , else the name in the second csv. If there is no name in the second csv, fill those with the name from the first csv.

import pandas as pd
import numpy as np

df1 = pd.DataFrame({'serial':[1,2,3,4,5,6],
                     'names':['James','Stephen','Ben','Harry','Jack','Peter']})

df2 = pd.DataFrame({'serial':[1,2,3,4,],
                     'names':['James','Kate','Ben','Sara']})


out = df1.merge(df2, how='outer', on = ['serial'])

out['flag'] = np.where(out.names_x == out.names_y, 'T', 'F')
out['names'] = np.where(out.flag == 'T', 0, out.names_y)
out['names'] = out['names'].fillna(out.names_x)

out = out[['serial', 'names', 'flag']]
out.to_csv("new.csv", index=False)

Output:

print(out)
   serial  names flag
0       1      0    T
1       2   Kate    F
2       3      0    T
3       4   Sara    F
4       5   Jack    F
5       6  Peter    F

You could use:

import pandas as pd
import numpy as np

df1 = pd.read_csv('original_names.csv')
df2 = pd.read_csv('dup_names.csv')

out = df1.merge(df2, how='left', on = 'Serial')

out['Names'] = np.where(out['Names_x'] == out['Names_y'], 
                        0, out['Names_y'])
out['Names'] = out['Names'].fillna(out['Names_x'])
out['flag'] = np.where(out['Names'] == 0, 'T', 'F')
out = out.drop(['Names_x', 'Names_y'], axis=1)

out.to_csv('new.csv', index=False)

Output:

   serial  names flag
0       1      0    T
1       2   Kate    F
2       3      0    T
3       4   Sara    F
4       5   Jack    F
5       6  Peter    F

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM