![](/img/trans.png)
[英]Pandas: how to compare columns of imported csv files to ensure they are the same?
[英]Compare 2 csv in pandas upon columns
我正在使用这个熊猫代码来获得 2 个 csv 的差异。 我想找到 1.csv 中 userdid 相同的所有行,csv 和 did 不同。
我怎样才能做到这一点?
>>> import pandas as pd
>>> df1 = pd.read_csv('1.csv')
>>> df2 = pd.read_csv('3.csv')
>>> df1.apply(tuple, 1).isin(df2.apply(tuple, 1))
预期的
R2N86Q6KXW9K1T0Q7PT,U3Y5VV6BP3SQQ5B8VHV,Active,"2021-01-01 00:27:03"
1.csv
did,userdid,ResumeStatus,sysinserteddt
R3X2PB661CST2HJQWVM,U3Y5VV6BP3SQQ5B8VHV,Active,"2021-01-01 00:27:03"
R2N86Q6KXW9K1T0Q7PT,U3Y5VV6BP3SQQ5B8VHV,Active,"2021-01-01 00:27:03"
3.csv
did,userdid,email,status,modified,sysinserteddt
R3X2PB661CST2HJQWVM,U3Y5VV6BP3SQQ5B8VHV,r@gmail.com,536870912,"2022-05-02 21:50:15.813","2022-05-02 21:50:15.907"
RD71J16YLWTXRDRV1YG,U3Y5VV6BP3SQQ5B8VHV,j.com,536870912,"2021-06-01 16:02:54.853","2021-06-01 16:02:52.15"
您可以首先在userid
上对df2
和df1
进行内部连接,然后通过df1
中的did
不等于df2
中的did
(名为did_temp
)的所有行查询您的数据框:
################################# Data Frames #################################
import pandas as pd
ls1 = ['did,userdid,ResumeStatus,sysinserteddt', 'R3X2PB661CST2HJQWVM,U3Y5VV6BP3SQQ5B8VHV,Active,"2021-01-01 00:27:03"', 'R2N86Q6KXW9K1T0Q7PT,U3Y5VV6BP3SQQ5B8VHV,Active,"2021-01-01 00:27:03"']
ls1 = [i.split(',') for i in ls1]
df1 = pd.DataFrame(ls1[1:], columns=ls1[0])
ls2 = ['did,userdid,email,status,modified,sysinserteddt', 'R3X2PB661CST2HJQWVM,U3Y5VV6BP3SQQ5B8VHV,r@gmail.com,536870912,2022-05-02 21:50:15.813,2022-05-02 21:50:15.907', 'RD71J16YLWTXRDRV1YG,U00RT61PD6SHSH2PTL,j.com,536870912,2021-06-01 16:02:54.853,2021-06-01 16:02:52.15']
ls2 = [i.split(',') for i in ls2]
df2 = pd.DataFrame(ls2[1:], columns=ls2[0])
###############################################################################
df_out = pd.merge(df1, df2[['did', 'userdid']].rename(columns={'did': 'did_temp'}), on=['userdid'], how="inner")
df_out[df_out['did'].ne(df_out['did_temp'])].drop(columns=['did_temp'])
输出:
did userdid ResumeStatus sysinserteddt
1 R2N86Q6KXW9K1T0Q7PT U3Y5VV6BP3SQQ5B8VHV Active "2021-01-01 00:27:03"
如果要输出字符串,可以执行以下操作:
df_out = pd.merge(df1, df2[['did', 'userdid']].rename(columns={'did': 'did_temp'}), on=['userdid'], how="inner")
','.join(df_out[df_out['did'].ne(df_out['did_temp'])].drop(columns=['did_temp']).values[0])
输出:
'R2N86Q6KXW9K1T0Q7PT,U3Y5VV6BP3SQQ5B8VHV,Active,"2021-01-01 00:27:03"'
鉴于:
df1
did userdid ResumeStatus sysinserteddt
0 R3X2PB661CST2HJQWVM U3Y5VV6BP3SQQ5B8VHV Active 2021-01-01 00:27:03
1 R2N86Q6KXW9K1T0Q7PT U3Y5VV6BP3SQQ5B8VHV Active 2021-01-01 00:27:03
df3
did userdid email status modified sysinserteddt
0 R3X2PB661CST2HJQWVM U3Y5VV6BP3SQQ5B8VHV r@gmail.com 536870912 2022-05-02 21:50:15.813 2022-05-02 21:50:15.907
1 RD71J16YLWTXRDRV1YG U3Y5VV6BP3SQQ5B8VHV j.com 536870912 2021-06-01 16:02:54.853 2021-06-01 16:02:52.15
正在做:
# All entries from df1 where userdid is in df3, but did is not~
df = df1[df1[['did', 'userdid']].isin(df3).eq([False, True]).all(axis=1)]
print(df)
输出:
did userdid ResumeStatus sysinserteddt
1 R2N86Q6KXW9K1T0Q7PT U3Y5VV6BP3SQQ5B8VHV Active 2021-01-01 00:27:03
到 CSV:
df.to_csv('file.csv', header=False, index=False)
# We can see what that looks like:
print(df.to_csv(header=False, index=False))
输出:
R2N86Q6KXW9K1T0Q7PT,U3Y5VV6BP3SQQ5B8VHV,Active,2021-01-01 00:27:03
有关更多信息... 阅读文档。
这可以简单地使用
>>> A = pd.read_csv("1.csv");
>>> B = pd.read_csv("3.csv");
>>> df = (pd.merge(A, B, on='did', how='left'))
>>> df1 = df[df['email'].isna()].drop('email', axis=1)
>>> print (df1)
did userdid_x ResumeStatus sysinserteddt_x userdid_y status modified sysinserteddt_y
1 R2N86Q6KXW9K1T0Q7PT U3Y5VV6BP3SQQ5B8VHV Active 2021-01-01 00:27:03 NaN NaN NaN NaN
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.