簡體   English   中英

有沒有一種有效的方法可以外部連接多個(超過 2 個)kafka 主題?

[英]Is there an efficient way to outer join several (more than 2) kafka topics?

我想通過鍵外部加入幾個(通常是 2-10 個)Kafka 主題,理想情況下使用流 API。所有主題都將具有相同的鍵和分區。 執行此連接的一種方法是為每個主題創建一個KStream並鏈式調用KStream.outerJoin

stream1
    .outerJoin(stream2, ...)
    .outerJoin(stream3, ...)
    .outerJoin(stream4, ...)

但是, KStream.outerJoin文檔表明每次調用outerJoin都會具體化它的兩個輸入流,因此上面的示例不僅會具體化流 1 到 4,還會stream1.outerJoin(stream2, ...)stream1.outerJoin(stream2, ...).outerJoin(stream3, ...) 與直接加入4個流相比,會有很多不必要的序列化、反序列化和I/O。

上述方法的另一個問題是JoinWindow在所有 4 個輸入流中不一致:一個JoinWindow將用於連接流 1 和 2,但隨后將使用單獨的連接 window 連接此 stream 和 stream 3,等等. 例如,我為每個連接指定 10 秒的連接 window,並且具有特定鍵的條目出現在 stream 1 在 0 秒,stream 2 在 6 秒,stream 3 在 12 秒,stream,4 在連接項目將在 18 秒后獲得 output,從而導致過高的延遲。 結果取決於連接的順序,這似乎不自然。

使用 Kafka 是否有更好的多路連接方法?

目前我不知道 Kafka Stream 有更好的方法,但它正在制作中:

https://cwiki.apache.org/confluence/display/KAFKA/KIP-150+-+Kafka-Streams+Cogroup

最終,我決定創建一個自定義的輕量級連接器,避免實現並嚴格遵守到期時間。 平均應該是 O(1)。 與消費者 API 相比,它更適合消費者 Stream API:對於每個消費者,使用任何接收到的數據重復輪詢和更新加入者; 如果加入者返回一個完整的屬性集,則轉發它。 這是代碼:

import java.util.ArrayList;
import java.util.Collection;
import java.util.Iterator;
import java.util.LinkedHashMap;
import java.util.Map;
import java.util.Optional;

/**
 * Inner joins multiple streams of data by key into one stream. It is assumed
 * that a key will appear in a stream exactly once. The values associated with
 * each key are collected and if all values are received within a certain
 * maximum wait time, the joiner returns all values corresponding to that key.
 * If not all values are received in time, the joiner never returns any values
 * corresponding to that key.
 * <p>
 * This class is not thread safe: all calls to
 * {@link #update(Object, Object, long)} must be synchronized.
 * @param <K> The type of key.
 * @param <V> The type of value.
 */
class StreamInnerJoiner<K, V> {

    private final Map<K, Vals<V>> idToVals = new LinkedHashMap<>();
    private final int joinCount;
    private final long maxWait;

    /**
     * Creates a stream inner joiner.
     * @param joinCount The number of streams being joined.
     * @param maxWait The maximum amount of time after an item has been seen in
     * one stream to wait for it to be seen in the remaining streams.
     */
    StreamInnerJoiner(final int joinCount, final long maxWait) {
        this.joinCount = joinCount;
        this.maxWait = maxWait;
    }

    private static class Vals<A> {
        final long firstSeen;
        final Collection<A> vals = new ArrayList<>();
        private Vals(final long firstSeen) {
            this.firstSeen = firstSeen;
        }
    }

    /**
     * Updates this joiner with a value corresponding to a key.
     * @param key The key.
     * @param val The value.
     * @param now The current time.
     * @return If all values for the specified key have been received, the
     * complete collection of values for thaht key; otherwise
     * {@link Optional#empty()}.
     */
    Optional<Collection<V>> update(final K key, final V val, final long now) {
        expireOld(now - maxWait);
        final Vals<V> curVals = getOrCreate(key, now);
        curVals.vals.add(val);
        return expireAndGetIffFull(key, curVals);
    }

    private Vals<V> getOrCreate(final K key, final long now) {
        final Vals<V> existingVals = idToVals.get(key);
        if (existingVals != null)
            return existingVals;
        else {
            /*
            Note: we assume that the item with the specified ID has not already
            been seen and timed out, and therefore that its first seen time is
            now. If the item has in fact already timed out, it is doomed and
            will time out again with no ill effect.
             */
            final Vals<V> curVals = new Vals<>(now);
            idToVals.put(key, curVals);
            return curVals;
        }
    }

    private void expireOld(final long expireBefore) {
        final Iterator<Vals<V>> i = idToVals.values().iterator();
        while (i.hasNext() && i.next().firstSeen < expireBefore)
            i.remove();
    }

    private Optional<Collection<V>> expireAndGetIffFull(final K key, final Vals<V> vals) {
        if (vals.vals.size() == joinCount) {
            // as all expired entries were already removed, this entry is valid
            idToVals.remove(key);
            return Optional.of(vals.vals);
        } else
            return Optional.empty();
    }
}

如果您合並所有流,您將得到您想要的。 查看教程,了解如何操作。

輸入流使用合並 function 進行組合,這會創建一個新的 stream 來表示其輸入的所有事件。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM