繁体   English   中英

在列上比较熊猫中的 2 个 csv

[英]Compare 2 csv in pandas upon columns

我正在使用这个熊猫代码来获得 2 个 csv 的差异。 我想找到 1.csv 中 userdid 相同的所有行,csv 和 did 不同。

我怎样才能做到这一点?

>>> import pandas as pd
>>> df1 = pd.read_csv('1.csv')
>>> df2 = pd.read_csv('3.csv')
>>> df1.apply(tuple, 1).isin(df2.apply(tuple, 1))

预期的

R2N86Q6KXW9K1T0Q7PT,U3Y5VV6BP3SQQ5B8VHV,Active,"2021-01-01 00:27:03"

1.csv

did,userdid,ResumeStatus,sysinserteddt
R3X2PB661CST2HJQWVM,U3Y5VV6BP3SQQ5B8VHV,Active,"2021-01-01 00:27:03"
R2N86Q6KXW9K1T0Q7PT,U3Y5VV6BP3SQQ5B8VHV,Active,"2021-01-01 00:27:03"

3.csv

did,userdid,email,status,modified,sysinserteddt
R3X2PB661CST2HJQWVM,U3Y5VV6BP3SQQ5B8VHV,r@gmail.com,536870912,"2022-05-02 21:50:15.813","2022-05-02 21:50:15.907"
RD71J16YLWTXRDRV1YG,U3Y5VV6BP3SQQ5B8VHV,j.com,536870912,"2021-06-01 16:02:54.853","2021-06-01 16:02:52.15"

您可以首先在userid上对df2df1进行内部连接,然后通过df1中的did不等于df2中的did (名为did_temp )的所有行查询您的数据框:

################################# Data Frames #################################
import pandas as pd

ls1 = ['did,userdid,ResumeStatus,sysinserteddt', 'R3X2PB661CST2HJQWVM,U3Y5VV6BP3SQQ5B8VHV,Active,"2021-01-01 00:27:03"', 'R2N86Q6KXW9K1T0Q7PT,U3Y5VV6BP3SQQ5B8VHV,Active,"2021-01-01 00:27:03"']
ls1 = [i.split(',') for i in ls1]
df1 = pd.DataFrame(ls1[1:], columns=ls1[0])

ls2 = ['did,userdid,email,status,modified,sysinserteddt', 'R3X2PB661CST2HJQWVM,U3Y5VV6BP3SQQ5B8VHV,r@gmail.com,536870912,2022-05-02 21:50:15.813,2022-05-02 21:50:15.907', 'RD71J16YLWTXRDRV1YG,U00RT61PD6SHSH2PTL,j.com,536870912,2021-06-01 16:02:54.853,2021-06-01 16:02:52.15']
ls2 = [i.split(',') for i in ls2]

df2 = pd.DataFrame(ls2[1:], columns=ls2[0])
###############################################################################

df_out = pd.merge(df1, df2[['did', 'userdid']].rename(columns={'did': 'did_temp'}), on=['userdid'], how="inner")
df_out[df_out['did'].ne(df_out['did_temp'])].drop(columns=['did_temp'])

输出:

    did                 userdid             ResumeStatus   sysinserteddt
1   R2N86Q6KXW9K1T0Q7PT U3Y5VV6BP3SQQ5B8VHV Active         "2021-01-01 00:27:03"

如果要输出字符串,可以执行以下操作:

df_out = pd.merge(df1, df2[['did', 'userdid']].rename(columns={'did': 'did_temp'}), on=['userdid'], how="inner")
','.join(df_out[df_out['did'].ne(df_out['did_temp'])].drop(columns=['did_temp']).values[0])

输出:

'R2N86Q6KXW9K1T0Q7PT,U3Y5VV6BP3SQQ5B8VHV,Active,"2021-01-01 00:27:03"'

鉴于:

df1

                   did              userdid ResumeStatus        sysinserteddt
0  R3X2PB661CST2HJQWVM  U3Y5VV6BP3SQQ5B8VHV       Active  2021-01-01 00:27:03
1  R2N86Q6KXW9K1T0Q7PT  U3Y5VV6BP3SQQ5B8VHV       Active  2021-01-01 00:27:03

df3

                   did              userdid        email     status                 modified            sysinserteddt
0  R3X2PB661CST2HJQWVM  U3Y5VV6BP3SQQ5B8VHV  r@gmail.com  536870912  2022-05-02 21:50:15.813  2022-05-02 21:50:15.907
1  RD71J16YLWTXRDRV1YG  U3Y5VV6BP3SQQ5B8VHV        j.com  536870912  2021-06-01 16:02:54.853   2021-06-01 16:02:52.15

正在做:

# All entries from df1 where userdid is in df3, but did is not~
df = df1[df1[['did', 'userdid']].isin(df3).eq([False, True]).all(axis=1)]
print(df)

输出:

                   did              userdid ResumeStatus        sysinserteddt
1  R2N86Q6KXW9K1T0Q7PT  U3Y5VV6BP3SQQ5B8VHV       Active  2021-01-01 00:27:03

到 CSV:

df.to_csv('file.csv', header=False, index=False)

# We can see what that looks like:
print(df.to_csv(header=False, index=False))

输出:

R2N86Q6KXW9K1T0Q7PT,U3Y5VV6BP3SQQ5B8VHV,Active,2021-01-01 00:27:03

有关更多信息... 阅读文档

这可以简单地使用

>>> A = pd.read_csv("1.csv");
>>> B = pd.read_csv("3.csv");
>>> df =  (pd.merge(A, B, on='did', how='left'))
>>> df1 = df[df['email'].isna()].drop('email', axis=1)
>>> print (df1)
                   did            userdid_x ResumeStatus      sysinserteddt_x userdid_y  status modified sysinserteddt_y
1  R2N86Q6KXW9K1T0Q7PT  U3Y5VV6BP3SQQ5B8VHV       Active  2021-01-01 00:27:03       NaN     NaN      NaN             NaN

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM