简体   繁体   English

如何删除滑动窗口中的重复项 - Apache Beam

[英]How to remove duplicates in sliding window - Apache Beam

I have implemented a data pipeline with multiple unbounded sources & side inputs, join data with sliding window (30s & every 10s) and emit the transformed output into a Kafka Topic.我已经实现了一个具有多个无限源和侧输入的数据管道,将数据与滑动窗口(30 秒和每 10 秒)连接起来,并将转换后的输出发送到 Kafka 主题中。 The issue i have is, the data received in the first 10 seconds of the window is emitted 3 times (ie) triggers whenever a new window starts until the first window is completed.我遇到的问题是,在窗口的前 10 秒内接收到的数据会发出 3 次(即)每当新窗口启动时触发,直到第一个窗口完成。 How to emit the transformed data only once or avoid duplicates ?如何只发出一次转换后的数据或避免重复?

I have used discard fired panes and it does not make a difference.我已经使用了废弃的烧制窗格,它没有任何区别。 Whenever i try setting Window closing behavior as FIRE_ALWAYS/FIRE_IF_NON_EMPTY, it throws the below error.每当我尝试将窗口关闭行为设置为 FIRE_ALWAYS/FIRE_IF_NON_EMPTY 时,它都会引发以下错误。

Exception in thread "main" org.apache.beam.sdk.Pipeline$PipelineExecutionException: java.lang.IllegalArgumentException: Empty PCollection accessed as a singleton view.线程“main” org.apache.beam.sdk.Pipeline$PipelineExecutionException 中的异常:java.lang.IllegalArgumentException:作为单例视图访问的空 PCollection。 Consider setting withDefault to provide a default value at org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:332) at org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:302) at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:197) at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:64) at org.apache.beam.sdk.Pipeline.run(Pipeline.java:313) at org.apache.beam.sdk.Pipeline.run(Pipeline.java:299) at y.yyy.main(yyy.java:86) Caused by: java.lang.IllegalArgumentException: Empty PCollection accessed as a singleton view.考虑设置 withDefault 以在 org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:332) at org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner. java:302) 在 org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:197) 在 org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:64) 在 org.apache .beam.sdk.Pipeline.run(Pipeline.java:313) at org.apache.beam.sdk.Pipeline.run(Pipeline.java:299) at y.yyy.main(yyy.java:86) 引起: java.lang.IllegalArgumentException:作为单例视图访问的空 PCollection。 Consider setting withDefault to provide a default value at org.apache.beam.sdk.transforms.View$SingletonCombineFn.identity(View.java:378) at org.apache.beam.sdk.transforms.Combine$BinaryCombineFn.extractOutput(Combine.java:481) at org.apache.beam.sdk.transforms.Combine$BinaryCombineFn.extractOutput(Combine.java:429) at org.apache.beam.sdk.transforms.Combine$CombineFn.apply(Combine.java:387) at org.apache.beam.sdk.transforms.Combine$GroupedValues$1.processElement(Combine.java:2089)考虑设置 withDefault 以在 org.apache.beam.sdk.transforms.View$SingletonCombineFn.identity(View.java:378) at org.apache.beam.sdk.transforms.Combine$BinaryCombineFn.extractOutput(Combine. java:481) 在 org.apache.beam.sdk.transforms.Combine$BinaryCombineFn.extractOutput(Combine.java:429) 在 org.apache.beam.sdk.transforms.Combine$CombineFn.apply(Combine.java:387)在 org.apache.beam.sdk.transforms.Combine$GroupedValues$1.processElement(Combine.java:2089)

data.apply("Transform", ParDo.of(
  new DoFn<String, Row>() {

    private static final long serialVersionUID = 1L;

    @ProcessElement
    public void processElement(
      ProcessContext processContext,
      final OutputReceiver<Row> emitter) {

        String record = processContext.element();
        final String[] parts = record.split(",");
        emitter.output(Row.withSchema(sch).addValues(parts).build());
    }
  })).apply(
    "window1",
    Window
      .<Row>into(
        SlidingWindows
          .of(Duration.standardSeconds(30))
          .every(Duration.standardSeconds(10)))
      .withAllowedLateness(
        Duration.ZERO,
        Window.ClosingBehavior.FIRE_IF_NON_EMPTY)
  .discardingFiredPanes());

Kindly guide me to trigger the window only once (ie) i don't want to send the records that are already processed请指导我只触发一次窗口(即)我不想发送已经处理的记录

Update: The Above error for Side Input occurs frequently & its not because of windows, seems like an issue in Apache Beam ( https://issues.apache.org/jira/browse/BEAM-6086 )更新:Side Input 的上述错误经常发生并且不是因为 Windows,这似乎是 Apache Beam 中的一个问题 ( https://issues.apache.org/jira/browse/BEAM-6086 )

I tried using State for identifying if a row is already processed or not, but the state is not retained or getting set.我尝试使用 State 来识别一行是否已经被处理,但状态没有被保留或被设置。 (ie) I always get null while reading the state. (即)我在阅读状态时总是为空。

public class CheckState extends DoFn<KV<String,String>,KV<Integer,String>> {
  private static final long serialVersionUID = 1L;

  @StateId("count")
  private final StateSpec<ValueState<String>> countState =
                     StateSpecs.value(StringUtf8Coder.of());

  @ProcessElement
  public void processElement(
    ProcessContext processContext,
    @StateId("count") ValueState<String> countState) {

        KV<String,String> record = processContext.element();
        String row = record.getValue();
        System.out.println("State: " + countState.read());
        System.out.println("Setting state as "+ record.getKey() + " for value"+ row.split(",")[0]);
        processContext.output(KV.of(current, row));
        countState.write(record.getKey());
    }

If I have understood the issue correctly, it can be related to the use of sliding windows in the pipeline:如果我正确理解了这个问题,它可能与管道中滑动窗口的使用有关:

A sliding time window overlap, nice explanation from Beam guides Window Functions滑动时间窗口重叠,来自 Beam guides Window Functions 的很好解释

" Because multiple windows overlap, most elements in a data set will belong to more than one window. This kind of windowing is useful for taking running averages of data; ... " 因为多个窗口重叠,数据集中的大多数元素将属于多个窗口。这种窗口化对于获取数据的运行平均值很有用;......

Fixed windows however will not overlap:但是固定窗口不会重叠:

"A fixed time window represents a consistent duration, non overlapping time interval in the data stream.." “固定的时间窗口代表数据流中一致的持续时间、非重叠的时间间隔..”

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM