apache flink 0.10 如何从无界输入数据流中第一次出现复合键？

Question

I am a newbie with apache flink.我是 apache flink 的新手。 I have an unbound data stream in my input (fed into flink 0.10 via kakfa).我的输入中有一个未绑定的数据流（通过 kakfa 输入 flink 0.10）。

I want to get the 1st occurence of each primary key (the primary key is the contract_num and the event_dt).我想获得每个主键的第一次出现（主键是 contract_num 和 event_dt）。
These "duplicates" occur nearly immediately after each other.这些“重复”几乎紧随其后。 The source system cannot filter this for me, so flink has to do it.源系统无法为我过滤这个，所以flink必须这样做。

Here is my input data:这是我的输入数据：

contract_num, event_dt, attr 
A1, 2016-02-24 10:25:08, X
A1, 2016-02-24 10:25:08, Y
A1, 2016-02-24 10:25:09, Z
A2, 2016-02-24 10:25:10, C

Here is the output data I want:这是我想要的输出数据：

A1, 2016-02-24 10:25:08, X
A1, 2016-02-24 10:25:09, Z
A2, 2016-02-24 10:25:10, C

Note the 2nd row has been removed as the key combination of A001 and '2016-02-24 10:25:08' already occurred in the 1st row.请注意，第二行已被删除，因为 A001 和“2016-02-24 10:25:08”的组合键已经出现在第一行中。

How can I do this with flink 0.10?我怎样才能用 flink 0.10 做到这一点？

I was thinking about using keyBy(0,1) but after that I don't know what to do!我正在考虑使用keyBy(0,1)但之后我不知道该怎么做！

(I used joda-time and org.flinkspector to setup these tests). （我使用 joda-time 和 org.flinkspector 来设置这些测试）。

@Test
public void test() {
    DateTime threeSecondsAgo = (new DateTime()).minusSeconds(3);
    DateTime twoSecondsAgo = (new DateTime()).minusSeconds(2);
    DateTime oneSecondsAgo = (new DateTime()).minusSeconds(2);

    DataStream<Tuple3<String, Date, String>> testStream =
            createTimedTestStreamWith(
                    Tuple3.of("A1", threeSecondsAgo.toDate(), "X"))
            .emit(Tuple3.of("A1", threeSecondsAgo.toDate(), "Y"), after(0, TimeUnit.NANOSECONDS))
            .emit(Tuple3.of("A1", twoSecondsAgo.toDate(), "Z"), after(0, TimeUnit.NANOSECONDS))
            .emit(Tuple3.of("A2", oneSecondsAgo.toDate(), "C"), after(0, TimeUnit.NANOSECONDS))
            .close();
    
    testStream.keyBy(0,1);
}

Answer 1

Filtering duplicates over an infinite stream will eventually fail if your key space is larger than your available storage space.如果您的密钥空间大于可用存储空间，则通过无限流过滤重复项最终将失败。 The reason is that you have to store the already seen keys somewhere to filter out the duplicates.原因是您必须将已经看到的键存储在某处以过滤掉重复项。 Thus, it would be good to define a time window after which you can purge the current set of seen keys.因此，最好定义一个时间窗口，之后您可以清除当前看到的密钥集。

If you're aware of this problem but want to try it anyway, you can do it by applying a stateful flatMap operation after the keyBy call.如果您知道这个问题但无论如何都想尝试一下，您可以通过在keyBy调用之后应用有状态的flatMap操作来实现。 The stateful mapper uses Flink's state abstraction to store whether it has already seen an element with this key or not.有状态映射器使用 Flink 的状态抽象来存储它是否已经看到具有此键的元素。 That way, you will also benefit from Flink's fault tolerance mechanism because your state will be automatically checkpointed.这样，您还将受益于 Flink 的容错机制，因为您的状态将被自动检查点。

A Flink program doing your job could look like一个完成你工作的 Flink 程序可能看起来像

public static void main(String[] args) throws Exception {
    StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

    DataStream<Tuple3<String, Date, String>> input = env.fromElements(Tuple3.of("foo", new Date(1000), "bar"), Tuple3.of("foo", new Date(1000), "foobar"));

    input.keyBy(0, 1).flatMap(new DuplicateFilter()).print();

    env.execute("Test");
}

where the implementation of DuplicateFilter depends on the version of Flink.其中DuplicateFilter的实现取决于 Flink 的版本。

Version >= 1.0 implementation版本 >= 1.0 实现

public static class DuplicateFilter extends RichFlatMapFunction<Tuple3<String, Date, String>, Tuple3<String, Date, String>> {

    static final ValueStateDescriptor<Boolean> descriptor = new ValueStateDescriptor<>("seen", Boolean.class, false);
    private ValueState<Boolean> operatorState;

    @Override
    public void open(Configuration configuration) {
        operatorState = this.getRuntimeContext().getState(descriptor);
    }

    @Override
    public void flatMap(Tuple3<String, Date, String> value, Collector<Tuple3<String, Date, String>> out) throws Exception {
        if (!operatorState.value()) {
            // we haven't seen the element yet
            out.collect(value);
            // set operator state to true so that we don't emit elements with this key again
            operatorState.update(true);
        }
    }
}

Version 0.10 implementation 0.10 版本实现

public static class DuplicateFilter extends RichFlatMapFunction<Tuple3<String, Date, String>, Tuple3<String, Date, String>> {

    private OperatorState<Boolean> operatorState;

    @Override
    public void open(Configuration configuration) {
        operatorState = this.getRuntimeContext().getKeyValueState("seen", Boolean.class, false);
    }

    @Override
    public void flatMap(Tuple3<String, Date, String> value, Collector<Tuple3<String, Date, String>> out) throws Exception {
        if (!operatorState.value()) {
            // we haven't seen the element yet
            out.collect(value);
            operatorState.update(true);
        }
    }
}

Update: Using a tumbling time window更新：使用翻滚时间窗口

input.keyBy(0, 1).timeWindow(Time.seconds(1)).apply(new WindowFunction<Iterable<Tuple3<String,Date,String>>, Tuple3<String, Date, String>, Tuple, TimeWindow>() {
    @Override
    public void apply(Tuple tuple, TimeWindow window, Iterable<Tuple3<String, Date, String>> input, Collector<Tuple3<String, Date, String>> out) throws Exception {
        out.collect(input.iterator().next());
    }
})

Answer 2

Here's another way to do this that I happen to have just written.这是我刚写的另一种方法。 It has the disadvantage that it's a bit more custom code since it doesn't use the built-in Flink windowing functions but it doesn't have the latency penalty that Till mentioned.它的缺点是它的自定义代码有点多，因为它不使用内置的 Flink 窗口函数，但它没有 Till 提到的延迟损失。 Full example on GitHub . GitHub 上的完整示例。

package com.dataartisans.filters;

import com.google.common.cache.CacheBuilder;
import com.google.common.cache.CacheLoader;
import com.google.common.cache.LoadingCache;
import org.apache.flink.api.common.functions.RichFilterFunction;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.checkpoint.CheckpointedAsynchronously;

import java.io.Serializable;
import java.util.HashSet;
import java.util.concurrent.TimeUnit;


/**
  * This class filters duplicates that occur within a configurable time of each other in a data stream.
  */
public class DedupeFilterFunction<T, K extends Serializable> extends RichFilterFunction<T> implements CheckpointedAsynchronously<HashSet<K>> {

  private LoadingCache<K, Boolean> dedupeCache;
  private final KeySelector<T, K> keySelector;
  private final long cacheExpirationTimeMs;

  /**
    * @param cacheExpirationTimeMs The expiration time for elements in the cache
    */
  public DedupeFilterFunction(KeySelector<T, K> keySelector, long cacheExpirationTimeMs){
    this.keySelector = keySelector;
    this.cacheExpirationTimeMs = cacheExpirationTimeMs;
  }

  @Override
  public void open(Configuration parameters) throws Exception {
    createDedupeCache();
  }


  @Override
  public boolean filter(T value) throws Exception {
    K key = keySelector.getKey(value);
    boolean seen = dedupeCache.get(key);
    if (!seen) {
      dedupeCache.put(key, true);
      return true;
    } else {
      return false;
    }
  }

  @Override
  public HashSet<K> snapshotState(long checkpointId, long checkpointTimestamp) throws Exception {
    return new HashSet<>(dedupeCache.asMap().keySet());
  }

  @Override
  public void restoreState(HashSet<K> state) throws Exception {
    createDedupeCache();
    for (K key : state) {
      dedupeCache.put(key, true);
    }
  }

  private void createDedupeCache() {
    dedupeCache = CacheBuilder.newBuilder()
      .expireAfterWrite(cacheExpirationTimeMs, TimeUnit.MILLISECONDS)
      .build(new CacheLoader<K, Boolean>() {
        @Override
        public Boolean load(K k) throws Exception {
          return false;
        }
      });
  }
}

apache flink 0.10 如何从无界输入数据流中第一次出现复合键？

问题描述

2 个解决方案

解决方案1
12 已采纳 2016-02-24 10:57:25

Version >= 1.0 implementation版本 >= 1.0 实现

Version 0.10 implementation 0.10 版本实现

Update: Using a tumbling time window更新：使用翻滚时间窗口

解决方案2
2 2016-02-25 20:24:12

apache flink 0.10 如何从无界输入数据流中第一次出现复合键？

问题描述

2 个解决方案

解决方案1 12 已采纳 2016-02-24 10:57:25

Version >= 1.0 implementation版本 >= 1.0 实现

Version 0.10 implementation 0.10 版本实现

Update: Using a tumbling time window更新：使用翻滚时间窗口

解决方案2 2 2016-02-25 20:24:12

解决方案1
12 已采纳 2016-02-24 10:57:25

解决方案2
2 2016-02-25 20:24:12