[英]How to compare two dataframes using column index?
I am exporting hdfs query output into a csv file using INSERT OVERWRITE LOCAL DIRECTORY command.我正在使用 INSERT OVERWRITE LOCAL DIRECTORY 命令将 hdfs 查询 output 导出到 csv 文件中。 Since this export the data without header.
由于这个导出数据没有 header。 I got another dataframe from Oracle output with file header which I need to compare against hdfs output.
I got another dataframe from Oracle output with file header which I need to compare against hdfs output.
df1 = pd.read_csv('/home/User/hdfs_result.csv', header = None)
print(df1)
0 1 2
0 XPRN A 2019-12-16 00:00:00
1 XPRW I 2019-12-16 00:00:00
2 XPS2 I 2003-09-30 00:00:00
df = pd.read_sql(sqlquery, sqlconn)
UNIT STATUS Date
0 XPRN A 2019-12-16 00:00:00
1 XPRW A 2019-12-16 00:00:00
2 XPS2 I 2003-09-30 00:00:00
Since df1 is having no header i cant use Merge or Join to compare data.由于 df1 没有 header 我不能使用 Merge 或 Join 来比较数据。 Though I can do df-df1.
虽然我可以做 df-df1.
Please suggest how can i compare and print the difference?请建议我如何比较和打印差异?
You can pass the underlying numpy array for comparison:您可以通过底层 numpy 数组进行比较:
df2.where(df2==df1.values)
Output (difference are masked as NaN
) Output (差异被掩盖为
NaN
)
UNIT STATUS Date
0 XPRN A 2019-12-16 00:00:00
1 XPRW NaN 2019-12-16 00:00:00
2 XPS2 I 2003-09-30 00:00:00
For non matching row:对于不匹配的行:
df2[(df2!=df1.values).any(1)]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.