简体   繁体   English

如何使用列索引比较两个数据框?

[英]How to compare two dataframes using column index?

I am exporting hdfs query output into a csv file using INSERT OVERWRITE LOCAL DIRECTORY command.我正在使用 INSERT OVERWRITE LOCAL DIRECTORY 命令将 hdfs 查询 output 导出到 csv 文件中。 Since this export the data without header.由于这个导出数据没有 header。 I got another dataframe from Oracle output with file header which I need to compare against hdfs output. I got another dataframe from Oracle output with file header which I need to compare against hdfs output.

df1 = pd.read_csv('/home/User/hdfs_result.csv', header = None)
print(df1)

      0  1                    2
0  XPRN  A  2019-12-16 00:00:00
1  XPRW  I  2019-12-16 00:00:00
2  XPS2  I  2003-09-30 00:00:00


df = pd.read_sql(sqlquery, sqlconn)


  UNIT  STATUS Date
0  XPRN  A     2019-12-16 00:00:00
1  XPRW  A     2019-12-16 00:00:00
2  XPS2  I     2003-09-30 00:00:00

Since df1 is having no header i cant use Merge or Join to compare data.由于 df1 没有 header 我不能使用 Merge 或 Join 来比较数据。 Though I can do df-df1.虽然我可以做 df-df1.

Please suggest how can i compare and print the difference?请建议我如何比较和打印差异?

You can pass the underlying numpy array for comparison:您可以通过底层 numpy 数组进行比较:

df2.where(df2==df1.values)

Output (difference are masked as NaN ) Output (差异被掩盖为NaN

   UNIT STATUS                 Date
0  XPRN      A  2019-12-16 00:00:00
1  XPRW    NaN  2019-12-16 00:00:00
2  XPS2      I  2003-09-30 00:00:00

For non matching row:对于不匹配的行:

df2[(df2!=df1.values).any(1)]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM