简体   繁体   English

apache flink 0.10 如何从无界输入数据流中第一次出现复合键?

[英]apache flink 0.10 how to get the first occurence of a composite key from an unbounded input dataStream?

I am a newbie with apache flink.我是 apache flink 的新手。 I have an unbound data stream in my input (fed into flink 0.10 via kakfa).我的输入中有一个未绑定的数据流(通过 kakfa 输入 flink 0.10)。

I want to get the 1st occurence of each primary key (the primary key is the contract_num and the event_dt).我想获得每个主键的第一次出现(主键是 contract_num 和 event_dt)。
These "duplicates" occur nearly immediately after each other.这些“重复”几乎紧随其后。 The source system cannot filter this for me, so flink has to do it.源系统无法为我过滤这个,所以flink必须这样做。

Here is my input data:这是我的输入数据:

contract_num, event_dt, attr 
A1, 2016-02-24 10:25:08, X
A1, 2016-02-24 10:25:08, Y
A1, 2016-02-24 10:25:09, Z
A2, 2016-02-24 10:25:10, C

Here is the output data I want:这是我想要的输出数据:

A1, 2016-02-24 10:25:08, X
A1, 2016-02-24 10:25:09, Z
A2, 2016-02-24 10:25:10, C

Note the 2nd row has been removed as the key combination of A001 and '2016-02-24 10:25:08' already occurred in the 1st row.请注意,第二行已被删除,因为 A001 和“2016-02-24 10:25:08”的组合键已经出现在第一行中。

How can I do this with flink 0.10?我怎样才能用 flink 0.10 做到这一点?

I was thinking about using keyBy(0,1) but after that I don't know what to do!我正在考虑使用keyBy(0,1)但之后我不知道该怎么做!

(I used joda-time and org.flinkspector to setup these tests). (我使用 joda-time 和 org.flinkspector 来设置这些测试)。

@Test
public void test() {
    DateTime threeSecondsAgo = (new DateTime()).minusSeconds(3);
    DateTime twoSecondsAgo = (new DateTime()).minusSeconds(2);
    DateTime oneSecondsAgo = (new DateTime()).minusSeconds(2);

    DataStream<Tuple3<String, Date, String>> testStream =
            createTimedTestStreamWith(
                    Tuple3.of("A1", threeSecondsAgo.toDate(), "X"))
            .emit(Tuple3.of("A1", threeSecondsAgo.toDate(), "Y"), after(0, TimeUnit.NANOSECONDS))
            .emit(Tuple3.of("A1", twoSecondsAgo.toDate(), "Z"), after(0, TimeUnit.NANOSECONDS))
            .emit(Tuple3.of("A2", oneSecondsAgo.toDate(), "C"), after(0, TimeUnit.NANOSECONDS))
            .close();
    
    testStream.keyBy(0,1);
}

Filtering duplicates over an infinite stream will eventually fail if your key space is larger than your available storage space.如果您的密钥空间大于可用存储空间,则通过无限流过滤重复项最终将失败。 The reason is that you have to store the already seen keys somewhere to filter out the duplicates.原因是您必须将已经看到的键存储在某处以过滤掉重复项。 Thus, it would be good to define a time window after which you can purge the current set of seen keys.因此,最好定义一个时间窗口,之后您可以清除当前看到的密钥集。

If you're aware of this problem but want to try it anyway, you can do it by applying a stateful flatMap operation after the keyBy call.如果您知道这个问题但无论如何都想尝试一下,您可以通过在keyBy调用之后应用有状态的flatMap操作来实现。 The stateful mapper uses Flink's state abstraction to store whether it has already seen an element with this key or not.有状态映射器使用 Flink 的状态抽象来存储它是否已经看到具有此键的元素。 That way, you will also benefit from Flink's fault tolerance mechanism because your state will be automatically checkpointed.这样,您还将受益于 Flink 的容错机制,因为您的状态将被自动检查点。

A Flink program doing your job could look like一个完成你工作的 Flink 程序可能看起来像

public static void main(String[] args) throws Exception {
    StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

    DataStream<Tuple3<String, Date, String>> input = env.fromElements(Tuple3.of("foo", new Date(1000), "bar"), Tuple3.of("foo", new Date(1000), "foobar"));

    input.keyBy(0, 1).flatMap(new DuplicateFilter()).print();

    env.execute("Test");
}

where the implementation of DuplicateFilter depends on the version of Flink.其中DuplicateFilter的实现取决于 Flink 的版本。

Version >= 1.0 implementation版本 >= 1.0 实现

public static class DuplicateFilter extends RichFlatMapFunction<Tuple3<String, Date, String>, Tuple3<String, Date, String>> {

    static final ValueStateDescriptor<Boolean> descriptor = new ValueStateDescriptor<>("seen", Boolean.class, false);
    private ValueState<Boolean> operatorState;

    @Override
    public void open(Configuration configuration) {
        operatorState = this.getRuntimeContext().getState(descriptor);
    }

    @Override
    public void flatMap(Tuple3<String, Date, String> value, Collector<Tuple3<String, Date, String>> out) throws Exception {
        if (!operatorState.value()) {
            // we haven't seen the element yet
            out.collect(value);
            // set operator state to true so that we don't emit elements with this key again
            operatorState.update(true);
        }
    }
}

Version 0.10 implementation 0.10 版本实现

public static class DuplicateFilter extends RichFlatMapFunction<Tuple3<String, Date, String>, Tuple3<String, Date, String>> {

    private OperatorState<Boolean> operatorState;

    @Override
    public void open(Configuration configuration) {
        operatorState = this.getRuntimeContext().getKeyValueState("seen", Boolean.class, false);
    }

    @Override
    public void flatMap(Tuple3<String, Date, String> value, Collector<Tuple3<String, Date, String>> out) throws Exception {
        if (!operatorState.value()) {
            // we haven't seen the element yet
            out.collect(value);
            operatorState.update(true);
        }
    }
}

Update: Using a tumbling time window更新:使用翻滚时间窗口

input.keyBy(0, 1).timeWindow(Time.seconds(1)).apply(new WindowFunction<Iterable<Tuple3<String,Date,String>>, Tuple3<String, Date, String>, Tuple, TimeWindow>() {
    @Override
    public void apply(Tuple tuple, TimeWindow window, Iterable<Tuple3<String, Date, String>> input, Collector<Tuple3<String, Date, String>> out) throws Exception {
        out.collect(input.iterator().next());
    }
})

Here's another way to do this that I happen to have just written.这是我刚写的另一种方法。 It has the disadvantage that it's a bit more custom code since it doesn't use the built-in Flink windowing functions but it doesn't have the latency penalty that Till mentioned.它的缺点是它的自定义代码有点多,因为它不使用内置的 Flink 窗口函数,但它没有 Till 提到的延迟损失。 Full example on GitHub . GitHub 上的完整示例。

package com.dataartisans.filters;

import com.google.common.cache.CacheBuilder;
import com.google.common.cache.CacheLoader;
import com.google.common.cache.LoadingCache;
import org.apache.flink.api.common.functions.RichFilterFunction;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.checkpoint.CheckpointedAsynchronously;

import java.io.Serializable;
import java.util.HashSet;
import java.util.concurrent.TimeUnit;


/**
  * This class filters duplicates that occur within a configurable time of each other in a data stream.
  */
public class DedupeFilterFunction<T, K extends Serializable> extends RichFilterFunction<T> implements CheckpointedAsynchronously<HashSet<K>> {

  private LoadingCache<K, Boolean> dedupeCache;
  private final KeySelector<T, K> keySelector;
  private final long cacheExpirationTimeMs;

  /**
    * @param cacheExpirationTimeMs The expiration time for elements in the cache
    */
  public DedupeFilterFunction(KeySelector<T, K> keySelector, long cacheExpirationTimeMs){
    this.keySelector = keySelector;
    this.cacheExpirationTimeMs = cacheExpirationTimeMs;
  }

  @Override
  public void open(Configuration parameters) throws Exception {
    createDedupeCache();
  }


  @Override
  public boolean filter(T value) throws Exception {
    K key = keySelector.getKey(value);
    boolean seen = dedupeCache.get(key);
    if (!seen) {
      dedupeCache.put(key, true);
      return true;
    } else {
      return false;
    }
  }

  @Override
  public HashSet<K> snapshotState(long checkpointId, long checkpointTimestamp) throws Exception {
    return new HashSet<>(dedupeCache.asMap().keySet());
  }

  @Override
  public void restoreState(HashSet<K> state) throws Exception {
    createDedupeCache();
    for (K key : state) {
      dedupeCache.put(key, true);
    }
  }

  private void createDedupeCache() {
    dedupeCache = CacheBuilder.newBuilder()
      .expireAfterWrite(cacheExpirationTimeMs, TimeUnit.MILLISECONDS)
      .build(new CacheLoader<K, Boolean>() {
        @Override
        public Boolean load(K k) throws Exception {
          return false;
        }
      });
  }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Flink DataStream-如何从输入元素启动源? - Flink DataStream - how to start a source from an input element? 为什么Apache Flink从数据流中删除事件? - Why is Apache Flink droping the event from datastream? Apache Flink:如何计算DataStream中的事件总数 - Apache Flink: How to count the total number of events in a DataStream 如何使用Java在Apache Flink中对DataStream执行平均操作 - How to perform average operation on DataStream in Apache Flink using Java DataStream上的Flink SQL查询(Apache Flink Java) - Flink sql Query on DataStream (Apache Flink Java) Apache Flink 将 DataStream(源)转换为 List? - Apache Flink transform DataStream (source) to a List? Apache Flink:为 DataStream 添加侧输入 API - Apache Flink : Add side inputs for DataStream API 将 DataStream 类型的对象从 Controller 传递到带有 Apache flink 和 Spring boot 的视图时出错 - Error when passing object of type DataStream from Controller to a view with Apache flink and Spring boot Flink DataStream 如何将一个自定义的 POJO 组合成另一个 DataStream - Flink How do DataStream combine a custom POJO into another DataStream 如何从句子中的arraylist获取任何单词的第一个出现的索引 - how to get the index for first occurence of any word from arraylist in sentence
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM