More efficient way of comparing two tables in Redshift?

Question

I have a series of stored procedures that contain logic which populate tables. I edit the stored procedure logic to populate new fields into the tables. Currently, to check how the stored procedure affects the tables, I'm taking a full copy of the table before the change and storing it as a new table in the redshift database, eg 'Table_test', so that I can check that the row counts are the same and the columns contain the same data. This seems like a very inefficient process of storing the whole old table to use for comparison against the new version of the table.

Is there a better/more efficient way of doing this process of comparing two tables in AWS Redshift?

Answer 1

What I have done in the past for comparing data between databases is to create a per column "MD5-like" signature. In your case you could do something similar on your "pre" table contents and your "post" table contents. This would only tell you which columns are different but this may be all you need.

Debug when there is a difference could be hard but you could "save" a copy of the table to S3 for debug use. This could defeat the speed you are looking for and you may only want to run this way when there is an issue or turn-on testing. You could also run such a process by "date" so that you could get the day and column that mismatches.

I've made such a signature several different ways as non-Redshift databases aren't always as fast as Redshift. Since you are comparing Redshift to Redshift the comparison process becomes easier and faster. What I'd do in this case is to preform MD5(columnN::text) for every column then convert a portion of the base64 result to BIGINT. Then you can sum() these values for every column. (SUM() is the easiest way to aggregate the column information and using a subset of the MD5 results.) Since MD5 signatures are large using a subset of the result is fine as the MD5 hash spreads the "uniqueness" across the result. Overflow can be an issue so adding a negative constant to each value can help with this. The resulting query will look something like:

select 
sum(nvl(strtol(substring({{column.column_name}}, 17, 8), 16) - (1::bigint << 31), 0))
from <<CTE>>;

This is from a jinja2 template I use for this process which allows me to read the table DDL and convert non-text columns to text in a CTE. Hopefully this snippet is clear enough on how the process works.

More efficient way of comparing two tables in Redshift?

Question

1 answers

solution1
0 2022-06-27 16:28:25

More efficient way of comparing two tables in Redshift?

Question

1 answers

solution1 0 2022-06-27 16:28:25

solution1
0 2022-06-27 16:28:25