简体   繁体   中英

AWS Redshift Migration

So I learned how to code in SQL about 2 months ago, so I'm still pretty new and still learning different commands/functions each day. I have been tasked with migrating some queries from Teradata to Redshift and there are obviously some differing syntax. Now I have been able to replace most of them, but I am stuck on a command "SYS_CALENDAR". Can someone explain to me how SYS_CALENDAR works so I could potentially hard code it or does anyone know any suitable replacements that run within AWS Redshift?

Thanks

As someone who has ported a large Teradata solution to Redshift let me say good luck. These are very different systems and porting the SQL to achieve functional equivalence is only the first challenge. I'm happy to have an exchange on what these challenges will likely be if you like but first off your question.

SYS_CALENDAR in Teradata is a system view that can be used like a normal view that holds information about every date. This can be queried or joined as needed to get, for example, the day-of-week or week-of-year information about a date. It really performs a date calculation function base on OS information but is used like a view.

No equivalent view exists in Redshift and this creates some porting difficulties. Many create "DATES" tables in Redshift to hold the information they need for dates across some range and there are web pages on making such a table (ex. https://elliotchance.medium.com/building-a-date-dimension-table-in-redshift-6474a7130658 ). Just pre-calculate all the date information you need for the range of dates in your database and you can swap this into queries when porting. This is the simplest route to take for porting and is the one that many choose (sometimes wrongly).

The issue with this route is that a user supported DATES table is often a time bomb waiting to go off and technical debt for the solution. This table only has the dates you specify at creation and the range of dates often expands over time. When it is used with a date that isn't in the DATES table wrong answers are created, data is corrupted, and it is usually silent. Not good. Some create processes to expand the date range but again this is based on some "expectation" of how the table will be used. It is also a real table with ever expanding data that is frequently used causing potential query performance issues and isn't really needed - a performance tax for all time.

The better long-term answer is to use the native Redshift (Postgres) date functions to operate on the dates as you need. Doing this uses the OS's understanding of dates (without bound) and does what Teradata does with the system view (calculate the needed information). For example you can get the work-week of a date by using the DATE_PART() function instead of joining with the SYS_CALENDAR view. This approach doesn't have the downsides of the DATES table but does come with porting cost. The structure of queries need to change (remove joins and add functions) which takes more work and requires understanding of the original query. Unfortunately time, work, and understanding are things that are often in short supply when porting databases which is why the DATES table approach is often seen and lives forever as technical debt.

I assume that this port is large in nature and if so my recommendation is this - lay out these trade offs for the stakeholders. If they cannot absorb the time to convert the queries (likely) propose the DATES table approach but have the technical debt clearly documented along with the "end date" at which functionality will break. I'd pick a somewhat close date, like 2025, so that some action will need to be on the long-term plans. Have triggers documented as to when action is needed.

This will not be the first of these "technical debt" issues that come up in a port such as this. There are too many places where "get it done" will trump "do it right". You haven't even scratch the surface on performance issues - these are very different databases and data solutions tuned, over time, for Teradata will not perform optimally on Redshift based on a simple port. This isn't an "all is lost" level issue. Just get the choices documented along with the long-term implications of those choices. Have triggers (dates or performance measures) defined for when aspects of the "port" will need to be followed up with an "optimization" effort. Management likes to forget about the need for follow-up on these efforts so get these documented.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM