简体   繁体   中英

Creating a BigQuery Transfer Service with more complex reges

I have a bucket that stores files based on a transaction time into a filename structure like

gs://my-bucket/YYYY/MM/DD/[autogeneratedID].parquet

Lets assume this structure dates back to 2015/01/01

Some of the files might arrive late, so in theory a new file could be written to the 2020/07/27 structure tomorrow.

I now want to create a BigQuery table that inserts all files with transaction date 2019-07-01 and newer.

My current strategy is to slice the past into small enough chunks to just run batch loads, eg by month. Then I want to create a transfer service that listens for all new files coming in. I cannot just point it to gs://my-bucket/* as this would try to load the date prior to 2019-07-01 .

So basically I was thinking about encoding the "future looking file name structures" into a suitable regex, but it seems like the wildcard names https://cloud.google.com/storage/docs/gsutil/addlhelp/WildcardNames only allow for very limited syntax which is not as flexible as awk regex for instance.

I know there are streaming inserts into BQ but still hoping to avoid that extra complexity and just make a smart configuration of the transfer config following the batch load.

You can use scheduled queries with external table . When you query your external table, you can use the pseudo column _FILE_NAME in the where condition of your request and filter on this.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM