Fastest way to compare if two tables have exactly the same contents

Question

I have two tables, with size 1TB each and are considered contains same data.
However, when I use them the same way, the results differs.

So, I would like to compare them record by record, in order to find out what's the difference.

My current solution is an ugly one:
I ordered them by the same key, output them into local disk and compare them with diff

Can someone suggest a more elegant method to achieve this?

Answer 1

Is it two copies of the same data? If so can you join the tables and select out the differences?

SQLFiddle

Quick example:

create table atable (
  id int,
  field1 int,
  field2 varchar(16)
  )

create table btable (
  id int,
  field1 int,
  field2 varchar(16)
  )

select * from atable as a
join btable as b on a.id = b.id
where a.field1 != b.field1
or a.field2 != b.field2

Answer 2

You could try hashing the rows of table 1 using any hash function and then run through table2 to see if there is any entry which is not already hashed. Theoretically it will be most efficient solution I guess.

You could use minhash/ LSH hash functions for scaling up.

Answer 3

As mentioned by Jay, using Hash() is a more effective and reliable solution than Joining the tables (which gives some difficulties when keys are repeated for instance).

You could have a look at this Python program that handles such comparisons of Hive tables, and would show you in a webpage the differences that might appear: https://github.com/bolcom/hive_compared_bq

Fastest way to compare if two tables have exactly the same contents

Question

3 answers

solution1
1 ACCPTED 2013-11-20 14:19:59

solution2
0

solution3
0 2017-11-22 06:44:17

Fastest way to compare if two tables have exactly the same contents

Question

3 answers

solution1 1 ACCPTED 2013-11-20 14:19:59

solution2 0

solution3 0 2017-11-22 06:44:17

solution1
1 ACCPTED 2013-11-20 14:19:59

solution2
0

solution3
0 2017-11-22 06:44:17