简体   繁体   中英

How to find differences between the same table but from 2 databases(snowflake/sql)

So I have a table that is in Testing and Production and I want to check to see whether there is any discrepancy on "code_num" column. I need to do this to validate that the table has been migrated correctly.

Is there an easy way to do this through SQL

You should be able to access both prod and dev from same web interface. Depending on your settings, you can access prod data like select * from prod.myschema.mytable and test by select * from test.myschema.mytable .

Then you can easily compare both of them like below-

--code_num in test but not in prod
select * from test.myschema.mytable where code_num not in (select code_num from prod.myschema.mytable);
--code_num in prod but not in test
select * from prod.myschema.mytable where code_num not in (select code_num from test.myschema.mytable);

Snowflake SQL has the minus operator for doing something like this. The query below will have a positive count if there is data in the test table that is not in prod.

select count(*) cnt from (
    select code_num from "prod"."schema"."table" 
    minus 
    select code_num from "test"."schema"."table" 
);

The problem statement that you have mentioned is also referred as "Load Assurance" or "Audit, Balance and Control" or "Reconciliation". The solution for this is usually designed and developed as a framework. The framework is developed in such a way that it should be able to take in different reconciliation rules, execute those rules and then persist for visualization later.

Now, coming to the techniques to reconcile. Reconciliation between source and target in a Data Lake or DWH scenario becomes challenging due to the volume of data. It would be computationally cost prohibitive if you would like to do a row by row comparison. So, the technique that is usually used is based on aggregation. I have earlier developed multiple such solutions which calculates the below aggregations on both source and target and then compares the result to find differences in migration. The aggregation functions are as below

Numeric Columns: MAX(), MIN(), Count of Distinct Values, Count of Null and Count(*) Non-Numeric: Count of Distinct Values, Count of Null

For row comparison, take a sample of rows(ensuring that it represents the population) and then created a message digest(MD5) of the rows to compare between source and target

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM