[英]apache flink 0.10 how to get the first occurence of a composite key from an unbounded input dataStream?
我是 apache flink 的新手。 我的輸入中有一個未綁定的數據流(通過 kakfa 輸入 flink 0.10)。
我想獲得每個主鍵的第一次出現(主鍵是 contract_num 和 event_dt)。
這些“重復”幾乎緊隨其后。 源系統無法為我過濾這個,所以flink必須這樣做。
這是我的輸入數據:
contract_num, event_dt, attr
A1, 2016-02-24 10:25:08, X
A1, 2016-02-24 10:25:08, Y
A1, 2016-02-24 10:25:09, Z
A2, 2016-02-24 10:25:10, C
這是我想要的輸出數據:
A1, 2016-02-24 10:25:08, X
A1, 2016-02-24 10:25:09, Z
A2, 2016-02-24 10:25:10, C
請注意,第二行已被刪除,因為 A001 和“2016-02-24 10:25:08”的組合鍵已經出現在第一行中。
我怎樣才能用 flink 0.10 做到這一點?
我正在考慮使用keyBy(0,1)
但之后我不知道該怎么做!
(我使用 joda-time 和 org.flinkspector 來設置這些測試)。
@Test
public void test() {
DateTime threeSecondsAgo = (new DateTime()).minusSeconds(3);
DateTime twoSecondsAgo = (new DateTime()).minusSeconds(2);
DateTime oneSecondsAgo = (new DateTime()).minusSeconds(2);
DataStream<Tuple3<String, Date, String>> testStream =
createTimedTestStreamWith(
Tuple3.of("A1", threeSecondsAgo.toDate(), "X"))
.emit(Tuple3.of("A1", threeSecondsAgo.toDate(), "Y"), after(0, TimeUnit.NANOSECONDS))
.emit(Tuple3.of("A1", twoSecondsAgo.toDate(), "Z"), after(0, TimeUnit.NANOSECONDS))
.emit(Tuple3.of("A2", oneSecondsAgo.toDate(), "C"), after(0, TimeUnit.NANOSECONDS))
.close();
testStream.keyBy(0,1);
}
如果您的密鑰空間大於可用存儲空間,則通過無限流過濾重復項最終將失敗。 原因是您必須將已經看到的鍵存儲在某處以過濾掉重復項。 因此,最好定義一個時間窗口,之后您可以清除當前看到的密鑰集。
如果您知道這個問題但無論如何都想嘗試一下,您可以通過在keyBy
調用之后應用有狀態的flatMap
操作來實現。 有狀態映射器使用 Flink 的狀態抽象來存儲它是否已經看到具有此鍵的元素。 這樣,您還將受益於 Flink 的容錯機制,因為您的狀態將被自動檢查點。
一個完成你工作的 Flink 程序可能看起來像
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<Tuple3<String, Date, String>> input = env.fromElements(Tuple3.of("foo", new Date(1000), "bar"), Tuple3.of("foo", new Date(1000), "foobar"));
input.keyBy(0, 1).flatMap(new DuplicateFilter()).print();
env.execute("Test");
}
其中DuplicateFilter
的實現取決於 Flink 的版本。
public static class DuplicateFilter extends RichFlatMapFunction<Tuple3<String, Date, String>, Tuple3<String, Date, String>> {
static final ValueStateDescriptor<Boolean> descriptor = new ValueStateDescriptor<>("seen", Boolean.class, false);
private ValueState<Boolean> operatorState;
@Override
public void open(Configuration configuration) {
operatorState = this.getRuntimeContext().getState(descriptor);
}
@Override
public void flatMap(Tuple3<String, Date, String> value, Collector<Tuple3<String, Date, String>> out) throws Exception {
if (!operatorState.value()) {
// we haven't seen the element yet
out.collect(value);
// set operator state to true so that we don't emit elements with this key again
operatorState.update(true);
}
}
}
public static class DuplicateFilter extends RichFlatMapFunction<Tuple3<String, Date, String>, Tuple3<String, Date, String>> {
private OperatorState<Boolean> operatorState;
@Override
public void open(Configuration configuration) {
operatorState = this.getRuntimeContext().getKeyValueState("seen", Boolean.class, false);
}
@Override
public void flatMap(Tuple3<String, Date, String> value, Collector<Tuple3<String, Date, String>> out) throws Exception {
if (!operatorState.value()) {
// we haven't seen the element yet
out.collect(value);
operatorState.update(true);
}
}
}
input.keyBy(0, 1).timeWindow(Time.seconds(1)).apply(new WindowFunction<Iterable<Tuple3<String,Date,String>>, Tuple3<String, Date, String>, Tuple, TimeWindow>() {
@Override
public void apply(Tuple tuple, TimeWindow window, Iterable<Tuple3<String, Date, String>> input, Collector<Tuple3<String, Date, String>> out) throws Exception {
out.collect(input.iterator().next());
}
})
這是我剛寫的另一種方法。 它的缺點是它的自定義代碼有點多,因為它不使用內置的 Flink 窗口函數,但它沒有 Till 提到的延遲損失。 GitHub 上的完整示例。
package com.dataartisans.filters;
import com.google.common.cache.CacheBuilder;
import com.google.common.cache.CacheLoader;
import com.google.common.cache.LoadingCache;
import org.apache.flink.api.common.functions.RichFilterFunction;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.checkpoint.CheckpointedAsynchronously;
import java.io.Serializable;
import java.util.HashSet;
import java.util.concurrent.TimeUnit;
/**
* This class filters duplicates that occur within a configurable time of each other in a data stream.
*/
public class DedupeFilterFunction<T, K extends Serializable> extends RichFilterFunction<T> implements CheckpointedAsynchronously<HashSet<K>> {
private LoadingCache<K, Boolean> dedupeCache;
private final KeySelector<T, K> keySelector;
private final long cacheExpirationTimeMs;
/**
* @param cacheExpirationTimeMs The expiration time for elements in the cache
*/
public DedupeFilterFunction(KeySelector<T, K> keySelector, long cacheExpirationTimeMs){
this.keySelector = keySelector;
this.cacheExpirationTimeMs = cacheExpirationTimeMs;
}
@Override
public void open(Configuration parameters) throws Exception {
createDedupeCache();
}
@Override
public boolean filter(T value) throws Exception {
K key = keySelector.getKey(value);
boolean seen = dedupeCache.get(key);
if (!seen) {
dedupeCache.put(key, true);
return true;
} else {
return false;
}
}
@Override
public HashSet<K> snapshotState(long checkpointId, long checkpointTimestamp) throws Exception {
return new HashSet<>(dedupeCache.asMap().keySet());
}
@Override
public void restoreState(HashSet<K> state) throws Exception {
createDedupeCache();
for (K key : state) {
dedupeCache.put(key, true);
}
}
private void createDedupeCache() {
dedupeCache = CacheBuilder.newBuilder()
.expireAfterWrite(cacheExpirationTimeMs, TimeUnit.MILLISECONDS)
.build(new CacheLoader<K, Boolean>() {
@Override
public Boolean load(K k) throws Exception {
return false;
}
});
}
}
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.