简体繁体 English

用于将数据从 MySql RDS 复制到 Redshift 的复制管道

[英]Replication pipeline to replicate data from MySql RDS to Redshift

原文 2020-04-10 08:35:52 5 1 mysql/ amazon-redshift/ amazon-rds/ database-replication/ data-pipeline

My problem is here to create a replication pipeline that replicates tables and data from MySql RDS to Redshift and I cannot use any managed service.我的问题是创建一个复制管道，将表和数据从 MySql RDS 复制到 Redshift，我不能使用任何托管服务。 Also, any new updates in RDS should be replicated in the redshift tables as well.此外，RDS 中的任何新更新也应复制到红移表中。

After looking at my many solutions, I came to an understanding of the following steps:在查看了我的许多解决方案之后，我了解了以下步骤：

Create flat files/CSVs from MySql RDS and save them in S3.从 MySql RDS 创建平面文件/CSV 并将它们保存在 S3 中。
Use Redshift's COPY command to copy data in staging tables and then finally save it to the main tables.使用 Redshift 的 COPY 命令复制 staging 表中的数据，最后保存到主表中。
Now, for the update part, every time I will push the CSVs to S3 and step 2 will be repeated.现在，对于更新部分，每次我将 CSV 推送到 S3 并重复第 2 步。

So, I just wanted to confirm if the above approach is fine?所以，我只是想确认上述方法是否可行？ As, every time when an update happens, will the old data be deleted completely and replaced by the new or is it possible to just update the necessary records.因为，每次更新发生时，旧数据是否会被完全删除并被新数据替换，或者是否可以只更新必要的记录。 If yes, then how?如果是，那么如何？

Any help will be really appreciated.任何帮助将不胜感激。 Thanks in advance.提前致谢。

1 个解决方案

Yes, above strategy is not just fine, its good .是的，上面的策略不仅很好，而且good 。 I use it in production system and it works great, though you have to careful and craft this strategy to make sure that it solves your use case effectively and efficiently .我在生产system中使用它并且效果很好，尽管您必须小心并制定此策略以确保它effectively且efficiently地解决您的用例。

Here is few points, what I mean by effectively and efficiently.这里有几点，我所说的有效和高效的意思。

Make sure you have most efficient way to identify the records to be pushed to Redshift , meaning identify the potential records with optimized queries that includes CPU , Memory .确保您有最有效的方法来识别要推送到Redshift的记录，这意味着使用包括CPU 、 Memory在内的优化查询来识别潜在记录。
Make sure to use optimized way to send the identified to redshift that includes data size optimization, so that it uses minimum storage and network bandwidth .确保使用优化的方式将识别到的发送到redshift ，包括数据大小优化，使其使用最小的storage和network bandwidth 。 eg compress and gzip CSV files, so that it takes minimum size in S3 storage and save network bandwidth.例如压缩和gzip CSV 文件，使其在S3存储中占用最小大小并节省network带宽。
Try to run copy redshift queries in a way that it executes in parallel.尝试以并行执行的方式运行copy redshift查询。