I have two tables, with size 1TB each and are considered contains same data.
However, when I use them the same way, the results differs.
So, I would like to compare them record by record, in order to find out what's the difference.
My current solution is an ugly one:
I ordered them by the same key, output them into local disk and compare them with diff
Can someone suggest a more elegant method to achieve this?
Is it two copies of the same data? If so can you join the tables and select out the differences?
Quick example:
create table atable (
id int,
field1 int,
field2 varchar(16)
)
create table btable (
id int,
field1 int,
field2 varchar(16)
)
select * from atable as a
join btable as b on a.id = b.id
where a.field1 != b.field1
or a.field2 != b.field2
You could try hashing the rows of table 1 using any hash function and then run through table2 to see if there is any entry which is not already hashed. Theoretically it will be most efficient solution I guess.
You could use minhash/ LSH hash functions for scaling up.
As mentioned by Jay, using Hash() is a more effective and reliable solution than Joining the tables (which gives some difficulties when keys are repeated for instance).
You could have a look at this Python program that handles such comparisons of Hive tables, and would show you in a webpage the differences that might appear: https://github.com/bolcom/hive_compared_bq
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.