Is switching from DB2 (en_US collation) to Snowflake (with default collation UTF-8) a good idea?

Question

At the company where I work, they are about to migrate from the legacy DB2 database to Snowflake.

Database Configuration for Database DWPROD
    Database territory                                      = US
    Database code page                                      = 819
    Database code set                                       = ISO8859-1
    LANG=en_US

The target database has been configured by default, meaning UTF-8 collation. There was already a need to trim all text columns before loading the data into Snowlake, because trailing spaces were causing problems with some joins. (On DB2 side, collation was responsible to take care of it) I've now realized yet another, obvious, problem with sorting:
Snowflake with UTF-8 sorts upper case letters before lower case letters (AZ first, then az). DB2 on the other hand sorts a,A before b,B and so on.

I'm trying to find more examples showing what might go wrong so I could present them to stop the madness.

I've already collected examples of issues listed above. I'm expecting (dreaming of) getting some answers from experienced people who has a lot of experience with collation, unicode. Some could say it's about the basic stuff. But these days it looks like everybody ignores it. It would also be great to share here some stories when such migrations failed or needed to be redone.

Answer 1

It's important to know the limitations of using non-default collation on Snowflake:

https://docs.snowflake.com/en/sql-reference/collation.html#collation-limitations

For me personally, the limitation on UDFs is sufficient reason to avoid changing the default collation. Sometimes there's simply no substitute for a UDF, and when you need one and can't use one with the non-default collation, this is a problem. The reduction in string limits from 16 to 8 Mb and no support for collated strings in arrays, objects, and variants are also a major considerations.

You can use trim() and ilike instead of like to handle case sensitivity and trailing/leading spaces. For sorting, you may need to have an upper/lower column, an age-old way to deal with case sensitive comparisons in databases.

Is switching from DB2 (en_US collation) to Snowflake (with default collation UTF-8) a good idea?

Question

1 answers

solution1
0 2022-11-21 14:20:45

Is switching from DB2 (en_US collation) to Snowflake (with default collation UTF-8) a good idea?

Question

1 answers

solution1 0 2022-11-21 14:20:45

solution1
0 2022-11-21 14:20:45