简体   繁体   中英

How to automatically transfer newly added avro data from GCS to BigQuery

I want to schedule the data transfer job between Cloud Storage to BigQuery. I have one application that dumps data continuously to the GCS bucket path (let's say gs://test-bucket/data1/*.avro ) that I want to move to BigQuery as soon as the object is created in GCS.

I don't want to migrate all the files available within the folder again and again. I just want to move only the newly added object after the last run in the folder.

BigQuery data transfer service is available that takes Avro files as input but not a folder and it does not provide only newly added objects instead all.

I am new to it so might be missing some functionality, How can I achieve it?

Please note - I want to schedule a job to load data at a certain frequency (every 10 or 15 min), I don't want any solution from a trigger perspective since the number of objects that will be generated will be huge.

You can use Cloud Function and Storage event trigger. Just launch Cloud Function that loads data into BigQuery when new file arrives. https://cloud.google.com/functions/docs/calling/storage EDIT: If you have more than 1500 loads per day you can workaround with loading using BQ Storage API.

If you do not need superb performance then you can just create an external table on that folder and query it instead loading every file.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM