简体繁体中英

How to perform Backfilling in redshift to bigquery migration?

原文 2020-09-23 03:26:55 9 1 google-cloud-platform/ google-bigquery

I am using BigQuery Data Transfer Service to migrate all data from redshift to bigquery.

After that, i want to perform backfilling for specific time, if any data is missing. But i don't see any backfilling option in Transfer job.

How can i achieve that in bigquery?

1 answers

Reading your question under the light of your comments I would proceed differently from what you describe. You reach the same goal however:).

Using your ETL pipeline, the first step would be to accumulate raw data in a datalake. Let's take a storage service like S3 to do so. For this ETL pipeline, S3 is your datasink. Note that your pipeline does nothing more than taking raw data from A to put it into S3. Also, the location in S3 should be under a timestampted folder on day for instance (eg: yyyymmdd) so that you can sort and consume your data on time dimension.
Obviously the considered data is ahead in time from the one you already have in Redshift. Maybe it is also a different structure from the one you already put in redshift due to potential transformation you set in your initial pipeline. In case you set raw data directly into redshift, then just export the data into the same S3 bucket under the name legacy/* . (In case it is transformed, then you have to put a second S3 datasink in your pipeline with this intermediary transformation an keep the same S3 naming strategy).

Let's take a break to understand what we have. We filled an S3 bucket with raw data that we can now replay at will on a specific day using a cron or an orchestrating tool such as Apache Airflow. Moreover you can freely modified the content of each timestamped folder in case you missed data to replay the following pipelines => the backfill you want.

Speaking of which, S3 would act as a data source for these following pipelines that would set wanted transformations on the raw data from S3 and choose BigQuery and potentially Redshift as Datasink. Now please take in consideration the price of these operations. Streaming API in BQ is expensive. As high of 0.50$ per Gb. Do that only if you need real time result. If you can afford latency of more than 5 minutes a better strategy would to set GCS as the datasink of your ETL and transfer the data from there into BQ (note to put the data in the same file naming pattern yyyymmdd to enable potential backfill). This transfer is free if GCS bucket and BQ dataset are in the same region. You would trigger the transfer with GCS events for instance (trigering a cloud function on blob creation that put the data into BQ).

Last but not least, backfilling should be done wisely especially in BQ where update or creation at row level is not peformant and is an open door for duplication. What you should consider is BigQuery partition that you can set on a column that contains a timestamp or an hidden one if your data contains none. Which timestamp? Well the one set in GCS folder name. Once again you can modify data in your GCS bucket per day and replay the transfer into BQ. But each transfer from a given day must overwrtite the partition the considered data belongs to. (e:g. the data under 20200914 would overwrite the associated partition in BQ. We abide by the concept of pure task doing so which a guarantee for idempotency and non duplication). Please read this article to have more insights.

Note: If you intend to get rid off Redshit, you can choose to do it directly and forget about S3 as a datasink of your first ETL. Choose directly GCS (ingress is free) and migrate your already present Redshift data into GCS using S3 as an intermediary service and the Google transfer service from S3 to GCS.

How to perform trigram operations in Google BigQuery?

Firebase Analytics data to Redshift - BigQuery RECORD data type to Redshift

Redshift to BigQuery using Apache Beam Dataflow

How to perform the UPSERT operation using the python BigQuery client when writing JSON record

How to read multiple (5) huge bigquery tables (approx 500 gb each) and perform join?

Bigquery to Redshift data transfer using AWS SCT failing?

How to CREATE TABLE AS in redshift?

Airflow not backfilling the missing dates

How to mask a column in Redshift?

Perform update operation in bigQuery through PHP

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question How to perform trigram operations in Google BigQuery? Firebase Analytics data to Redshift - BigQuery RECORD data type to Redshift Redshift to BigQuery using Apache Beam Dataflow How to perform the UPSERT operation using the python BigQuery client when writing JSON record How to read multiple (5) huge bigquery tables (approx 500 gb each) and perform join? Bigquery to Redshift data transfer using AWS SCT failing? How to CREATE TABLE AS in redshift? Airflow not backfilling the missing dates How to mask a column in Redshift? Perform update operation in bigQuery through PHP

Related Tags

How to perform Backfilling in redshift to bigquery migration?

Question

1 answers

solution1 2 ACCPTED 2020-09-23 07:48:54

solution1
2 ACCPTED 2020-09-23 07:48:54