简体   繁体   中英

Load data stored on google cloud storage with multi character delimiter to BigQuery

I want to load data with multiple character delimiter to BigQuery. BQ load command currently does not support multiple character delimiter. It supports only single character delimiter like '|', '$', '~' etc

I know there is a dataflow approach where it will read data from those files and write to BigQuery. But I have a large number of small files(each file of 400MB) which have to be written a separate partition of a table(partition numbering around 700). This approach is slow with dataflow because I have to currently start a different dataflow job for writing each file to a separate table using a for loop. This approach is running for more than 24 hours and still not complete.

So is there any other approach to load these multiple files having multiple character delimiter to each partition of BigQuery?

From the Dataflow perspective, you can make this easier by uploading multiple files in each pipeline. You can have a for loop in your main method while assembling the pipeline, essentially having many Read -> Write to BigQuery steps.

See also Strategy for loading data into BigQuery and Google cloud Storage from local disk for more information.

My lazy approach to these problems: Don't parse in Dataflow, just send each row raw to BigQuery (one column per row).

Then you can parse inside BigQuery with a JS UDF.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM