简体   繁体   中英

Fastest way to compare if two tables have exactly the same contents

I have two tables, with size 1TB each and are considered contains same data.
However, when I use them the same way, the results differs.

So, I would like to compare them record by record, in order to find out what's the difference.


My current solution is an ugly one:
I ordered them by the same key, output them into local disk and compare them with diff

Can someone suggest a more elegant method to achieve this?

Is it two copies of the same data? If so can you join the tables and select out the differences?

SQLFiddle

Quick example:

create table atable (
  id int,
  field1 int,
  field2 varchar(16)
  )

create table btable (
  id int,
  field1 int,
  field2 varchar(16)
  )

select * from atable as a
join btable as b on a.id = b.id
where a.field1 != b.field1
or a.field2 != b.field2

You could try hashing the rows of table 1 using any hash function and then run through table2 to see if there is any entry which is not already hashed. Theoretically it will be most efficient solution I guess.

You could use minhash/ LSH hash functions for scaling up.

As mentioned by Jay, using Hash() is a more effective and reliable solution than Joining the tables (which gives some difficulties when keys are repeated for instance).

You could have a look at this Python program that handles such comparisons of Hive tables, and would show you in a webpage the differences that might appear: https://github.com/bolcom/hive_compared_bq

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM