简体   繁体   English

如何比较 2 个 CSV 文件

[英]How to compare 2 CSV files

I have 2 CSV files:我有 2 个 CSV 文件:

CSV 1 - original_names.csv CSV 1 - original_names.csv

Serial,Names
1,James
2,Stephen
3,Ben
4,Harry
5,Jack
6, Peter

CSV 2 - dup_names.csv CSV 2 - dup_names.csv

Serial,Names
1,James
2,Kate
3,Ben
4,Sara


Desired Output - new.csv所需的 Output - new.csv

Serial,Names,flag
1,0,T
2,Kate,F
3,0,T
4,Sara,F
5,Jack,F
6,Peter,F

As you can see, the same names in both CSV will be updated to 0 if names matches to new.csv.如您所见,如果名称与 new.csv 匹配,则 CSV 中的相同名称将更新为 0。

This is what I've tried:这是我尝试过的:

import pandas as pd

df1 = pd.read_csv('original_names.csv')
df2 = pd.read_csv('dup_names.csv')

out = df1.merge(df2['names'], how='inner', on = 'names')

# some code

out.to_csv("new.csv", index=False)


Thank you for your time:)感谢您的时间:)

Do an outer join, then just add some logic here.做一个外连接,然后在这里添加一些逻辑。 If the 2 name columns match, put a 'T' flag in, else put 'F' .如果 2 个名称列匹配,则放入'T'标志,否则放入'F' Then replace the 'names' should be 0 is 'T' , else the name in the second csv.然后替换'names'应该是0'T' ,否则第二个 csv 中的名称。 If there is no name in the second csv, fill those with the name from the first csv.如果第二个 csv 中没有名称,则填写第一个 csv 中的名称。

import pandas as pd
import numpy as np

df1 = pd.DataFrame({'serial':[1,2,3,4,5,6],
                     'names':['James','Stephen','Ben','Harry','Jack','Peter']})

df2 = pd.DataFrame({'serial':[1,2,3,4,],
                     'names':['James','Kate','Ben','Sara']})


out = df1.merge(df2, how='outer', on = ['serial'])

out['flag'] = np.where(out.names_x == out.names_y, 'T', 'F')
out['names'] = np.where(out.flag == 'T', 0, out.names_y)
out['names'] = out['names'].fillna(out.names_x)

out = out[['serial', 'names', 'flag']]
out.to_csv("new.csv", index=False)

Output: Output:

print(out)
   serial  names flag
0       1      0    T
1       2   Kate    F
2       3      0    T
3       4   Sara    F
4       5   Jack    F
5       6  Peter    F

You could use:你可以使用:

import pandas as pd
import numpy as np

df1 = pd.read_csv('original_names.csv')
df2 = pd.read_csv('dup_names.csv')

out = df1.merge(df2, how='left', on = 'Serial')

out['Names'] = np.where(out['Names_x'] == out['Names_y'], 
                        0, out['Names_y'])
out['Names'] = out['Names'].fillna(out['Names_x'])
out['flag'] = np.where(out['Names'] == 0, 'T', 'F')
out = out.drop(['Names_x', 'Names_y'], axis=1)

out.to_csv('new.csv', index=False)

Output: Output:

   serial  names flag
0       1      0    T
1       2   Kate    F
2       3      0    T
3       4   Sara    F
4       5   Jack    F
5       6  Peter    F

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM