简体   繁体   中英

Apache Beam GenerateSequence does not emit elements at specified rates

I am experimenting with Apache Beam, and tried to use the GenerateSequence PTransform as a simple way to generate an unbounded, streaming data source.

The GenerateSequence class provides the method withRate(long numElements, Duration periodLength) , which to my understanding, controls the rate of elements produced per period as well as the duration of the period. To my surprise, the rate at which the elements were produced was not consistent with the provided description.

For example, I tried to use the following snippet of code:

Pipeline p = Pipeline.create(pipelineOptions);
Duration runtimeDuration = Duration.standardSeconds(20L);
Duration periodDuration = Duration.standardSeconds(1L);
PCollection<String> generated_seq = p.apply("Get Sequence",
        GenerateSequence.from(1)
                .withMaxReadTime(runtimeDuration)
                .withRate(1, periodDuration))
        .apply("Test sequence generation", ParDo.of(new DoFn<Long,String>() {
            @ProcessElement
            public void processElement(@Element Long in, OutputReceiver<String> out){
                long userId = in % 5; //simulate events from 5 users
                Instant timestamp = Instant.now();
                DateTimeFormatter fmt = DateTimeFormat.forPattern("yyyy-MM-dd HH-mm-ss.SSSzZ").withZone(DateTimeZone.forID("Etc/GMT"));
                System.out.println(in + " => UserId:" + userId + "|Timestamp: " + timestamp.toString(fmt));
                out.outputWithTimestamp(Long.toString(userId), timestamp);
            }
        }));

The pipeline generated the resulting sequence:

5 => UserId:0|Timestamp: 2021-03-08 21-20-25.361GMT+0000
14 => UserId:4|Timestamp: 2021-03-08 21-20-25.446GMT+0000
6 => UserId:1|Timestamp: 2021-03-08 21-20-25.450GMT+0000
12 => UserId:2|Timestamp: 2021-03-08 21-20-25.452GMT+0000
7 => UserId:2|Timestamp: 2021-03-08 21-20-25.456GMT+0000
9 => UserId:4|Timestamp: 2021-03-08 21-20-25.459GMT+0000
13 => UserId:3|Timestamp: 2021-03-08 21-20-25.461GMT+0000
1 => UserId:1|Timestamp: 2021-03-08 21-20-25.463GMT+0000
2 => UserId:2|Timestamp: 2021-03-08 21-20-25.465GMT+0000
16 => UserId:1|Timestamp: 2021-03-08 21-20-25.468GMT+0000
10 => UserId:0|Timestamp: 2021-03-08 21-20-25.469GMT+0000
8 => UserId:3|Timestamp: 2021-03-08 21-20-25.471GMT+0000
15 => UserId:0|Timestamp: 2021-03-08 21-20-25.474GMT+0000
4 => UserId:4|Timestamp: 2021-03-08 21-20-25.476GMT+0000
17 => UserId:2|Timestamp: 2021-03-08 21-20-25.478GMT+0000
11 => UserId:1|Timestamp: 2021-03-08 21-20-25.488GMT+0000
3 => UserId:3|Timestamp: 2021-03-08 21-20-37.613GMT+0000

As observed above, the majority of elements were generated within the same second despite specifying withRate(1, periodDuration) ; that is, specifying that at most 1 element should be generated in a period of 1 second

I have tried to dig into the SDK code to understand and hopefully resolve the reason for this behavior but I could not identify its cause. Thus, is there a way to either resolve this issue, or are there any similar PTransforms that can emulate an unbounded, streaming source?

Cause

This is likely caused by a quirk in the GenerateSequence transform that isn't really explained by the documentation. Specifically, the way the underlying source used to generate the numbers ( CountingSource ) works is that if it runs out of elements to emit, there is a brief wait before the source is checked again. If this wait time is greater than the period duration, then multiple elements may be queued up next time the source is checked, and the source will advance through them rapidly.

So in your example, what is likely happening is that the source starts, and does not emit any elements yet because the period of one second hasn't passed. It is checked again several seconds later, at which point it rapidly emits all elements that should have been emitted in that time period, until it runs out and then waits again. This can be seen with the last element in your example, which was emitted 12 seconds after the previous one. Expanding the runtime duration is a good way to see this in action; You will likely see multiple batches of elements being emitted.

Timestamps

The behavior described above works perfectly if all you need are the raw numbers being generated periodically. However if you are using GenerateSequence to test a pipeline that depends on timestamps, you will want to set a custom TimestampFn to set the timestamps for each emitted element. The default TimestampFn in the source may be a good example to use. A very simple one that might work for you is to set the timestamp to match the value of the element.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM