简体   繁体   English

Flink 重复数据删除和 processWindowFunction

[英]Flink deduplication and processWindowFunction

i'm creating a pipeline where the inputs are json messages containing a timestamp field, used to set eventTime.我正在创建一个管道,其中输入是包含时间戳字段的 json 消息,用于设置 eventTime。 The problem is about the fact that some record could arrive late or duplicate at the system, and this situations needs to be managed;问题在于某些记录可能会延迟到达或在系统中重复,这种情况需要管理; to avoid duplicates I tried the following solution:为避免重复,我尝试了以下解决方案:

                .assignTimestampsAndWatermarks(new RecordWatermark()
                        .withTimestampAssigner(new ExtractRecordTimestamp()))
                .keyBy(new MetricGrouper())
                .window(TumblingEventTimeWindows.of(Time.seconds(60)))
                .trigger(ContinuousEventTimeTrigger.of(Time.seconds(3)))
                .process(new WindowedFilter())
                .keyBy(new MetricGrouper())
                .window(TumblingEventTimeWindows.of(Time.seconds(180)))
                .trigger(ContinuousEventTimeTrigger.of(Time.seconds(15)))
                .process(new WindowedCountDistinct())
                .map((value) -> value.toString());

where the first windowing operation is done to filter the records based on timestamp saved in a set, as follow:其中第一个窗口操作是根据保存在集合中的时间戳来过滤记录,如下所示:

public class WindowedFilter extends ProcessWindowFunction<MetricObject, MetricObject, String, TimeWindow> {
    HashSet<Long> previousRecordTimestamps = new HashSet<>();

    @Override
    public void process(String s, Context context, Iterable<MetricObject> inputs, Collector<MetricObject> out) throws Exception {
        String windowStart = DateTimeFormatter.ISO_INSTANT.format(Instant.ofEpochMilli(context.window().getStart()));
        String windowEnd = DateTimeFormatter.ISO_INSTANT.format(Instant.ofEpochMilli(context.window().getEnd()));
        log.info("window start: '{}', window end: '{}'", windowStart, windowEnd);

        Long watermark = context.currentWatermark();
        log.info(inputs.toString());
        for (MetricObject in : inputs) {
            Long recordTimestamp = in.getTimestamp().toEpochMilli();
            if (!previousRecordTimestamps.contains(recordTimestamp)) {
                log.info("timestamp not contained");
                previousRecordTimestamps.add(recordTimestamp);
                out.collect(in);
            }
        }
    }

this solution works, but I've the feeling that I'm not considering something important or it could be done in a better way.这个解决方案有效,但我感觉我没有考虑一些重要的事情,或者可以以更好的方式完成。

One potential problem with using windows for deduplication is that the windows implemented in Flink's DataStream API are always aligned to the epoch.使用 windows 进行重复数据删除的一个潜在问题是 Flink 的 DataStream API 中实现的 windows 始终与 epoch 对齐。 This means that, for example, an event occurring at 11:59:59, and a duplicate occurring at 12:00:01, will be placed into different minute-long windows.这意味着,例如,发生在 11:59:59 的事件和发生在 12:00:01 的重复事件将被放置到不同的分钟窗口中。

However, in your case it appears that the duplicates you are concerned about also carry the same timestamp.但是,在您的情况下,您担心的重复项似乎也带有相同的时间戳。 In that case, what you're doing will produce correct results, so long as you're not concerned about the watermarking producing late events.在这种情况下,您所做的将产生正确的结果,只要您不担心水印会产生延迟事件。

The other issue with using windows for deduplication is the latency they impose on the pipeline, and the workarounds used to minimize that latency.使用 windows 进行重复数据删除的另一个问题是它们对管道施加的延迟,以及用于最小化该延迟的变通方法。

This is why I prefer to implement deduplication with a RichFlatMapFunction or a KeyedProcessFunction .这就是为什么我更喜欢使用RichFlatMapFunctionKeyedProcessFunction实现重复数据删除。 Something like this will perform better than a window:像这样的东西会比窗口表现得更好:

private static class Event {
  public final String key;
}

public static void main(String[] args) throws Exception {
  StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
  
  env.addSource(new EventSource())
    .keyBy(e -> e.key)
    .flatMap(new Deduplicate())
    .print();
  
  env.execute();
}

public static class Deduplicate extends RichFlatMapFunction<Event, Event> {
  ValueState<Boolean> seen;

  @Override
  public void open(Configuration conf) {
    StateTtlConfig ttlConfig = StateTtlConfig
      .newBuilder(Time.minutes(1))
      .setUpdateType(StateTtlConfig.UpdateType.OnCreateAndWrite)
      .cleanupFullSnapshot()
      .build();
    ValueStateDescriptor<Boolean> desc = new ValueStateDescriptor<>("seen", Types.BOOLEAN);
    desc.enableTimeToLive(ttlConfig);
    seen = getRuntimeContext().getState(desc);
  }

  @Override
  public void flatMap(Event event, Collector<Event> out) throws Exception {
    if (seen.value() == null) {
      out.collect(event);
      seen.update(true);
    }
  }
}

Here the stream is being deduplicated by key , and the state involved is being automatically cleared after one minute.这里的流正在通过key进行重复数据删除,并且在一分钟后自动清除所涉及的状态。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM