繁体   English   中英

有没有一种有效的方法可以外部连接多个(超过 2 个)kafka 主题?

[英]Is there an efficient way to outer join several (more than 2) kafka topics?

我想通过键外部加入几个(通常是 2-10 个)Kafka 主题,理想情况下使用流 API。所有主题都将具有相同的键和分区。 执行此连接的一种方法是为每个主题创建一个KStream并链式调用KStream.outerJoin

stream1
    .outerJoin(stream2, ...)
    .outerJoin(stream3, ...)
    .outerJoin(stream4, ...)

但是, KStream.outerJoin文档表明每次调用outerJoin都会具体化它的两个输入流,因此上面的示例不仅会具体化流 1 到 4,还会stream1.outerJoin(stream2, ...)stream1.outerJoin(stream2, ...).outerJoin(stream3, ...) 与直接加入4个流相比,会有很多不必要的序列化、反序列化和I/O。

上述方法的另一个问题是JoinWindow在所有 4 个输入流中不一致:一个JoinWindow将用于连接流 1 和 2,但随后将使用单独的连接 window 连接此 stream 和 stream 3,等等. 例如,我为每个连接指定 10 秒的连接 window,并且具有特定键的条目出现在 stream 1 在 0 秒,stream 2 在 6 秒,stream 3 在 12 秒,stream,4 在连接项目将在 18 秒后获得 output,从而导致过高的延迟。 结果取决于连接的顺序,这似乎不自然。

使用 Kafka 是否有更好的多路连接方法?

目前我不知道 Kafka Stream 有更好的方法,但它正在制作中:

https://cwiki.apache.org/confluence/display/KAFKA/KIP-150+-+Kafka-Streams+Cogroup

最终,我决定创建一个自定义的轻量级连接器,避免实现并严格遵守到期时间。 平均应该是 O(1)。 与消费者 API 相比,它更适合消费者 Stream API:对于每个消费者,使用任何接收到的数据重复轮询和更新加入者; 如果加入者返回一个完整的属性集,则转发它。 这是代码:

import java.util.ArrayList;
import java.util.Collection;
import java.util.Iterator;
import java.util.LinkedHashMap;
import java.util.Map;
import java.util.Optional;

/**
 * Inner joins multiple streams of data by key into one stream. It is assumed
 * that a key will appear in a stream exactly once. The values associated with
 * each key are collected and if all values are received within a certain
 * maximum wait time, the joiner returns all values corresponding to that key.
 * If not all values are received in time, the joiner never returns any values
 * corresponding to that key.
 * <p>
 * This class is not thread safe: all calls to
 * {@link #update(Object, Object, long)} must be synchronized.
 * @param <K> The type of key.
 * @param <V> The type of value.
 */
class StreamInnerJoiner<K, V> {

    private final Map<K, Vals<V>> idToVals = new LinkedHashMap<>();
    private final int joinCount;
    private final long maxWait;

    /**
     * Creates a stream inner joiner.
     * @param joinCount The number of streams being joined.
     * @param maxWait The maximum amount of time after an item has been seen in
     * one stream to wait for it to be seen in the remaining streams.
     */
    StreamInnerJoiner(final int joinCount, final long maxWait) {
        this.joinCount = joinCount;
        this.maxWait = maxWait;
    }

    private static class Vals<A> {
        final long firstSeen;
        final Collection<A> vals = new ArrayList<>();
        private Vals(final long firstSeen) {
            this.firstSeen = firstSeen;
        }
    }

    /**
     * Updates this joiner with a value corresponding to a key.
     * @param key The key.
     * @param val The value.
     * @param now The current time.
     * @return If all values for the specified key have been received, the
     * complete collection of values for thaht key; otherwise
     * {@link Optional#empty()}.
     */
    Optional<Collection<V>> update(final K key, final V val, final long now) {
        expireOld(now - maxWait);
        final Vals<V> curVals = getOrCreate(key, now);
        curVals.vals.add(val);
        return expireAndGetIffFull(key, curVals);
    }

    private Vals<V> getOrCreate(final K key, final long now) {
        final Vals<V> existingVals = idToVals.get(key);
        if (existingVals != null)
            return existingVals;
        else {
            /*
            Note: we assume that the item with the specified ID has not already
            been seen and timed out, and therefore that its first seen time is
            now. If the item has in fact already timed out, it is doomed and
            will time out again with no ill effect.
             */
            final Vals<V> curVals = new Vals<>(now);
            idToVals.put(key, curVals);
            return curVals;
        }
    }

    private void expireOld(final long expireBefore) {
        final Iterator<Vals<V>> i = idToVals.values().iterator();
        while (i.hasNext() && i.next().firstSeen < expireBefore)
            i.remove();
    }

    private Optional<Collection<V>> expireAndGetIffFull(final K key, final Vals<V> vals) {
        if (vals.vals.size() == joinCount) {
            // as all expired entries were already removed, this entry is valid
            idToVals.remove(key);
            return Optional.of(vals.vals);
        } else
            return Optional.empty();
    }
}

如果您合并所有流,您将得到您想要的。 查看教程,了解如何操作。

输入流使用合并 function 进行组合,这会创建一个新的 stream 来表示其输入的所有事件。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM