简体   繁体   中英

Row processing data from Redshift to Redshift

We are working on requirement where we want to fetch incremental data from one redshift cluster "row wise", process it based on requirement and insert it in another redshift cluster. We want to do it " row wise" not " batch operation ." For that we are writing one generic service which will do row processing from Redshift -> Redshift. So, it is like Redshift -> Service -> Redshift . For inserting data, we will use insert queries to insert. We will commit after particular batch not row wise for performance. But I am bit worried about performance of multiple insert queries. Or is there any other tool available which does it. There are many ETL tools available but all do batch processing. We want to process row wise. Can someone please suggest on it?

I can guarantee that your approach will not be efficient based on experience. You can refer this link for detailed best practices :

https://docs.aws.amazon.com/redshift/latest/dg/c_loading-data-best-practices.html

But, I would suggest that you do as follows :

  1. Write a python script to unload the data from your source Redshift to S3 based on a query condition that filters data as per your requirement, ie based on some threshold like time, date etc. This operation should be fast and you can schedule this script to execute every minute or in a couple of minutes, generating multiple files.

  2. Now, you basically have a continuous stream of files in S3, where the size of each file or batch size can be controlled based on your frequency for the previous script.

  3. Now, all you have to do is set up a service that keeps polling S3 for objects/files as and when they are created and then process them as needed and put the processed file in another bucket. Let's call this as B2.

  4. Set up another python script/ETL step that remotely executes a COPY command from bucket B2.

This is just an initial idea though. You have to evolve on this approach and optimize this. Best of luck!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM