简体   繁体   中英

How does a Beam runner determine the size of each bundle of a PCollection

According to Apache Beam Execution Model - Bundling and persistence :

"Instead of processing all elements simultaneously, the elements in a PCollection are processed in bundles. The division of the collection into bundles is arbitrary and selected by the runner. This allows the runner to choose an appropriate middle-ground between persisting results after every element, and having to retry everything if there is a failure. For example, a streaming runner may prefer to process and commit small bundles, and a batch runner may prefer to process larger bundles."

The paragraph suggests that the size of a bundle is arbitrary and determined by a runner. I have checked out the source code of Apache Beam, and looked into the Runner module in order to figure out how the runners determine the bundle size. However, I can't figure that out.

Can someone point out in which java class(es) or configuration file(s) the size of bundles is calculated? I am interested in how it works for the DirectRunner and Cloud Dataflow Runner.

This generally isn't a knob meant to be set, and infact its not an availabile knob in the open source code of the dataflow runner harness/beam sdk themselves. Runners make a choice when packing a bundle, based on the runner's preferences/goals for running a performant pipeline.

In Dataflow, some of the closed source backend systems are determining this based on a variety of factors, including the sharding, how much data is available for a particuler key, and the progress/throughput of the pipeline. The bundle size itself, is not based on any sort of static numbers, but picked dynamically based on what's currently happening inside the pipeline/workers.

By looking into BigQueryIo you can see that they added a step that generate shard number per each record:

  .apply("ShardTableWrites", ParDo.of(new GenerateShardedTable<>(numShards)))

After that, they applied Reshuffle and GlobalWindow which is like group by shard id, now each bundle will be num of rows that generated with same shard id.

Putting larger shard number to achieve larger bundles.

I used it for BigTable read and write and it works perfect.

full code:

  PCollectionTuple tuple =
    tagged
        .apply(Reshuffle.of())
        // Put in the global window to ensure that DynamicDestinations side inputs are accessed
        // correctly.
        .apply(
            "GlobalWindow",
            Window.<KV<ShardedKey<String>, TableRowInfo<ElementT>>>into(new GlobalWindows())
                .triggering(DefaultTrigger.of())
                .discardingFiredPanes())
        .apply(
            "StreamingWrite",
            ParDo.of(
                    new StreamingWriteFn<>(
                        bigQueryServices,
                        retryPolicy,
                        failedInsertsTag,
                        errorContainer,
                        skipInvalidRows,
                        ignoreUnknownValues,
                        ignoreInsertIds,
                        toTableRow))
                .withOutputTags(mainOutputTag, TupleTagList.of(failedInsertsTag)));

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM