简体   繁体   中英

How to set file size instead of number of shards when writing from BigQuery to Cloud Storage in Dataflow

Currently using Dataflow to read in table data from BigQuery and write to Cloud Storage with a set number of shards.

//Read Main Input
PCollection<TableRow> input = pipeline.apply("ReadTableInput",
    BigQueryIO.readTableRows().from("dataset.table"));

// process and write files
input.apply("ProcessRows", ParDo.of(new Process())
    .apply("WriteToFile", TextIO.write()
        .to(outputFile)
        .withHeader(HEADER)
        .withSuffix(".csv")
        .withNumShards(numShards));

To Manage file size we've estimated total number shards necessary in order to keep files under a certain size.

Is there a way to instead of set number of shards, set file size and let shards be dynamic?

By design, it's not possible. If you deep dive into the core of Beam, you programmatically define an execution graph and then you Run it. The process is massively parallel ( ParDo means 'Parallel Do'), on the same node or on several nodes/VM.

Here the number of shard is simply the number of "writers" which will work in parallel for writing files. Then PCollection will be split to all worker writing.

Size is very variable (size of message for example, text encoding, compression or not and compression factor,...) and Beam can't rely on it for building its graph.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM