簡體   English   中英

2個連續的流-流內部連接產生錯誤的結果:流之間的KStream連接在內部真正做了什么?

[英]2 consecutive stream-stream inner joins produce wrong results: what does KStream join between streams really do internally?

問題設定

我有一個 stream 節點和一個 stream 代表圖形的連續更新的邊,我想使用多個串聯連接來構建由節點和邊組成的模式。 假設我想匹配這樣的模式: (node1) --[edge1]--> (node2)
我的想法是將節點的 stream 與邊的 stream 結合起來,以組成類型為子模式的stream --> (node1) --> 然后將生成的 stream 與節點的 stream 再次連接,以組成最終模式(node1) --[edge1]--> (node2) 對特定類型的節點和邊的過濾並不重要。

數據 model

所以我有以 Avro 格式構建的節點、邊緣和模式:

{
  "namespace": "DataModel",
  "type": "record",
  "name": "Node",
  "doc": "Node schema, it contains a nodeID label and properties",
  "fields": [
    {
      "name": "nodeID",
      "type": "long"
    },
    {
      "name": "labels",
      "type": {
        "type": "array",
        "items": "string",
        "avro.java.string": "String"
      }
    },
    {
      "name": "properties",
      "type": {
        "type": "map",
        "values": "string",
        "avro.java.string": "String"
      }
    },
    {
      "name": "timestamp",
      "type": "long"
    }
  ]
}

{
  "namespace": "DataModel",
  "type": "record",
  "name": "Edge",
  "doc": "contains edgeID, a type, a list of properties, a starting node ID and an ending node ID ",
  "fields": [
    {
      "name": "edgeID",
      "type": "long"
    },
    {
      "name": "type",
      "type": "string"
    },
    {
      "name": "properties",
      "type": {
        "type": "map",
        "values": "string",
        "avro.java.string": "String"
      }
    },
    {
      "name": "startID",
      "type": "long"
    },
    {
      "name": "endID",
      "type": "long"
    },
    {
      "name": "timestamp",
      "type": "long"
    }
  ]
}

{
    "namespace": "DataModel",
    "type": "record",
    "name": "Pattern",
    "fields": [
      {
        "name": "first",
        "type": "long"
      },
      {
        "name": "nextJoinID",
        "type": [
          "null",
          "long"
        ],
        "default": null
      },
      {
        "name": "timestamp",
        "type": "long"
      },
      {
        "name": "segments",
        "doc": "It's the ordered list of nodes and edges that compose this sub-pattern from the leftmost node to the rightmost edge or node",
        "type": {
          "type": "array",
          "items": [
            "DataModel.Node",
            "DataModel.Edge"
          ]
        }
      }

然后我有以下兩個ValueJoiners:

第一個用於節點 stream 和邊 stream 的內部連接。
第二個用於超模式 stream 和節點 stream 的內部連接。

public class NodeEdgeJoiner implements ValueJoiner<Node, Edge, Pattern> {

    @Override
    public Pattern apply(Node node, Edge edge) {

        Object[] segments = {node,edge};
        return Pattern.newBuilder()
                .setFirst(node.getNodeID())
                .setNextJoinID(edge.getEndID())
                .setSegments(Arrays.asList(segments))
                .setTimestamp(Math.min(node.getTimestamp(),edge.getTimestamp()))
                .build();

    }
}
public class PatternNodeJoiner implements ValueJoiner<Pattern, Node, Pattern> {

    @Override
    public Pattern apply(Pattern pattern, Node node) {

        List<Object> segments = pattern.getSegments();
        segments.add(node);
        return Pattern.newBuilder()
                .setFirst(pattern.getFirst())
                .setNextJoinID(node.getNodeID())
                .setSegments(segments)
                .setTimestamp(Math.min(node.getTimestamp(),pattern.getTimestamp()))
                .build();
    }
}

我的意圖是捕捉如下模式: (nodeId == 1)--[label == "related_to"]-->()其中

  • (nodeId == 1) 表示 id=1 的節點
  • --[label == "related_to"]--> 用 label = "related_to" 表示有向邊
  • () 表示一個通用節點。

將這些片段連接在一起的想法是使用之前的 Valuejoiners 執行兩個連續的連接。 我希望您關注兩個 ValueJoiners 執行的第一個操作:為了構建模式,我只是簡單地將 append 節點和邊放在作為模式的 Avro 模式的一部分的列表的末尾。 以下是生成節點和邊並將它們發布在相應主題中的通用循環。 每個節點記錄的key對應於nodeID,每個邊記錄的key是邊的傳入節點的nodeID。

while(true){

            try (final KafkaProducer<Long, Node> nodeKafkaProducer = new KafkaProducer<Long, Node>(props)) {

                final KafkaProducer<Long, Edge> edgeKafkaProducer = new KafkaProducer<Long, Edge>(props);

                nodeKafkaProducer.send(new ProducerRecord<Long, Node>(nodeTopic, (long) 1,
                        buildNodeRecord(1, Collections.singletonList("aString"), "aString",
                                System.currentTimeMillis())));
                edgeKafkaProducer.send(new ProducerRecord<Long, Edge>(edgesTopic, (long) 1,
                        buildEdgeRecord(1, 1, 4, "related_to", "aString",
                                System.currentTimeMillis())));
                Thread.sleep(9000);

            } catch (InterruptedException e) {
                e.printStackTrace();
            }
        }

在哪里:

private Node buildNodeRecord(long nodeId, List<String> labelsToSet, String property, long timestamp){

        Node record = new Node();
        record.setNodeID(nodeId);
        record.setLabels(labelsToSet);
        Map<String, String> propMap = new HashMap<String, String>();
        propMap.put("property", property);
        record.setProperties(propMap);
        record.setTimestamp(timestamp);
        return record;


    }  
private Edge buildEdgeRecord(long edgeId,long startID, long endID, String type, String property, long timestamp) {

        Edge record = new Edge();
        record.setEdgeID(edgeId);
        record.setStartID(startID);
        record.setEndID(endID);
        record.setType(type);

        Map<String,String> propMap = new HashMap<String, String>();
        propMap.put("property",property);
        record.setProperties(propMap);
        record.setTimestamp(timestamp);

        return record;
    }

代碼的以下部分描述了管道。

//configuration of specific avro serde for pattern type
        final SpecificAvroSerde<Pattern> patternSpecificAvroSerde = new SpecificAvroSerde<>();
        final Map<String, String> serdeConfig = Collections.singletonMap(
                AbstractKafkaSchemaSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG, envProps.getProperty("schema.registry.url"));
        patternSpecificAvroSerde.configure(serdeConfig,false);

        //the valueJoiners we need
        final NodeEdgeJoiner nodeEdgeJoiner = new NodeEdgeJoiner();
        final PatternNodeJoiner patternNodeJoiner = new PatternNodeJoiner();

        //timestampExtractors
        NodeTimestampExtractor nodeTimestampExtractor = new NodeTimestampExtractor();
        SubPatternTimeStampExtractor subPatternTimeStampExtractor = new SubPatternTimeStampExtractor();
        EdgeTimestampExtractor edgeTimestampExtractor = new EdgeTimestampExtractor();

        //node source
        final KStream<Long, Node> nodeKStream = builder.stream(envProps.getProperty("node.topic.name"),
                Consumed.with(nodeTimestampExtractor));

       //filter on nodes topic
        nodeKStream.filter((key, value) -> value.getNodeID()==1).to(envProps.getProperty("firstnodes.topic.name"));
        final KStream<Long,Node> firstFilteredNodes = builder.stream(envProps.getProperty("firstnodes.topic.name"),
                Consumed.with(nodeTimestampExtractor));

        //edges keyed by incoming node
        final KStream<Long,Edge> edgeKstream = builder.stream(envProps.getProperty("edge.topic.name"),
                Consumed.with(edgeTimestampExtractor));

        //filter operation on edges for the first part of the pattern
        final KStream<Long,Edge> firstEdgeFiltered = edgeKstream.filter((key, value) ->
                value.getType().equals("related_to"));

        //first join
        firstFilteredNodes.join(firstEdgeFiltered,nodeEdgeSubJoiner,
                JoinWindows.of(Duration.ofSeconds(10)))
                .map((key, value) -> new KeyValue<Long, SubPattern>(value.getNextJoinID(), value))
                .to(envProps.getProperty("firstJoin.topic.name"));

        final KStream <Long,SubPattern> mappedFirstJoin = builder.stream(envProps.getProperty("firstJoin.topic.name"),
                Consumed.with(subPatternTimeStampExtractor));

        //second join
        KStream <Long,Pattern> secondJoin = mappedFirstJoin
                .join(nodeKStream,subPatternNodeJoiner, JoinWindows.of(Duration.ofSeconds(10)));
        secondJoin.print(Printed.toSysOut()); // should print out final records

我不會展示時間戳提取器,因為我認為它們與重點無關。

問題

所以我希望 output 是模式記錄的 stream每個模式的列表(來自 Avro 模式的“段”)的大小相同:1 個節點 1 個邊和另一個節點。 但這不會發生。 相反,我得到了這個 output:

[KSTREAM-MERGE-0000000018]: 4, {"first": 1, "nextJoinID": 4, "timestamp": 1611252427338, "segments": [{"nodeID": 1, "labels": ["aString"], "properties": {"property": "aString"}, "timestamp": 1611252427338}, {"edgeID": 1, "type": "related_to", "properties": {"property": "aString"}, "startID": 1, "endID": 4, "timestamp": 1611252427777}, {"nodeID": 4, "labels": ["aString"], "properties": {"property": "aString"}, "timestamp": 1611252427795}]}
[KSTREAM-MERGE-0000000018]: 4, {"first": 1, "nextJoinID": 4, "timestamp": 1611252427338, "segments": [{"nodeID": 1, "labels": ["aString"], "properties": {"property": "aString"}, "timestamp": 1611252427338}, {"edgeID": 1, "type": "related_to", "properties": {"property": "aString"}, "startID": 1, "endID": 4, "timestamp": 1611252427777}, {"nodeID": 4, "labels": ["aString"], "properties": {"property": "aString"}, "timestamp": 1611252427795}, {"nodeID": 4, "labels": ["aString"], "properties": {"property": "aString"}, "timestamp": 1611252436847}]}
[KSTREAM-MERGE-0000000018]: 4, {"first": 1, "nextJoinID": 4, "timestamp": 1611252427338, "segments": [{"nodeID": 1, "labels": ["aString"], "properties": {"property": "aString"}, "timestamp": 1611252427338}, {"edgeID": 1, "type": "related_to", "properties": {"property": "aString"}, "startID": 1, "endID": 4, "timestamp": 1611252436837}, {"nodeID": 4, "labels": ["aString"], "properties": {"property": "aString"}, "timestamp": 1611252427795}]}
[KSTREAM-MERGE-0000000018]: 4, {"first": 1, "nextJoinID": 4, "timestamp": 1611252427338, "segments": [{"nodeID": 1, "labels": ["aString"], "properties": {"property": "aString"}, "timestamp": 1611252427338}, {"edgeID": 1, "type": "related_to", "properties": {"property": "aString"}, "startID": 1, "endID": 4, "timestamp": 1611252436837}, {"nodeID": 4, "labels": ["aString"], "properties": {"property": "aString"}, "timestamp": 1611252427795}, {"nodeID": 4, "labels": ["aString"], "properties": {"property": "aString"}, "timestamp": 1611252436847}]}
[KSTREAM-MERGE-0000000018]: 4, {"first": 1, "nextJoinID": 4, "timestamp": 1611252427777, "segments": [{"nodeID": 1, "labels": ["aString"], "properties": {"property": "aString"}, "timestamp": 1611252436822}, {"edgeID": 1, "type": "related_to", "properties": {"property": "aString"}, "startID": 1, "endID": 4, "timestamp": 1611252427777}, {"nodeID": 4, "labels": ["aString"], "properties": {"property": "aString"}, "timestamp": 1611252427795}]}
[KSTREAM-MERGE-0000000018]: 4, {"first": 1, "nextJoinID": 4, "timestamp": 1611252427777, "segments": [{"nodeID": 1, "labels": ["aString"], "properties": {"property": "aString"}, "timestamp": 1611252436822}, {"edgeID": 1, "type": "related_to", "properties": {"property": "aString"}, "startID": 1, "endID": 4, "timestamp": 1611252427777}, {"nodeID": 4, "labels": ["aString"], "properties": {"property": "aString"}, "timestamp": 1611252427795}, {"nodeID": 4, "labels": ["aString"], "properties": {"property": "aString"}, "timestamp": 1611252436847}]}
[KSTREAM-MERGE-0000000018]: 4, {"first": 1, "nextJoinID": 4, "timestamp": 1611252427795, "segments": [{"nodeID": 1, "labels": ["aString"], "properties": {"property": "aString"}, "timestamp": 1611252436822}, {"edgeID": 1, "type": "related_to", "properties": {"property": "aString"}, "startID": 1, "endID": 4, "timestamp": 1611252436837}, {"nodeID": 4, "labels": ["aString"], "properties": {"property": "aString"}, "timestamp": 1611252427795}]}
[KSTREAM-MERGE-0000000018]: 4, {"first": 1, "nextJoinID": 4, "timestamp": 1611252436822, "segments": [{"nodeID": 1, "labels": ["aString"], "properties": {"property": "aString"}, "timestamp": 1611252436822}, {"edgeID": 1, "type": "related_to", "properties": {"property": "aString"}, "startID": 1, "endID": 4, "timestamp": 1611252436837}, {"nodeID": 4, "labels": ["aString"], "properties": {"property": "aString"}, "timestamp": 1611252427795}, {"nodeID": 4, "labels": ["aString"], "properties": {"property": "aString"}, "timestamp": 1611252436847}]}
.  
.  
.  

如您所見,每條記錄中的有序節點和邊數組的大小是不同的。 特別是我總是在它們中看到:一個節點和一條邊,后面跟着許多節點。 如果我減少 while(true){...} 中的睡眠毫秒數,它會變得更糟,並生成非常長的列表,列表中有更多節點。 我保證節點-邊連接在每種情況下都表現良好。 它總是產生正確的結果。 該問題似乎影響了第二次加入。 但我不明白如何..我試圖做一些測試但沒有成功..

以下是拓撲:

Topologies:
   Sub-topology: 0
    Source: KSTREAM-SOURCE-0000000000 (topics: [nodes])
      --> KSTREAM-WINDOWED-0000000015, KSTREAM-FILTER-0000000001
    Source: KSTREAM-SOURCE-0000000013 (topics: [firstJoin])
      --> KSTREAM-WINDOWED-0000000014
    Processor: KSTREAM-WINDOWED-0000000014 (stores: [KSTREAM-JOINTHIS-0000000016-store])
      --> KSTREAM-JOINTHIS-0000000016
      <-- KSTREAM-SOURCE-0000000013
    Processor: KSTREAM-WINDOWED-0000000015 (stores: [KSTREAM-JOINOTHER-0000000017-store])
      --> KSTREAM-JOINOTHER-0000000017
      <-- KSTREAM-SOURCE-0000000000
    Processor: KSTREAM-JOINOTHER-0000000017 (stores: [KSTREAM-JOINTHIS-0000000016-store])
      --> KSTREAM-MERGE-0000000018
      <-- KSTREAM-WINDOWED-0000000015
    Processor: KSTREAM-JOINTHIS-0000000016 (stores: [KSTREAM-JOINOTHER-0000000017-store])
      --> KSTREAM-MERGE-0000000018
      <-- KSTREAM-WINDOWED-0000000014
    Processor: KSTREAM-FILTER-0000000001 (stores: [])
      --> KSTREAM-SINK-0000000002
      <-- KSTREAM-SOURCE-0000000000
    Processor: KSTREAM-MERGE-0000000018 (stores: [])
      --> KSTREAM-PRINTER-0000000019
      <-- KSTREAM-JOINTHIS-0000000016, KSTREAM-JOINOTHER-0000000017
    Processor: KSTREAM-PRINTER-0000000019 (stores: [])
      --> none
      <-- KSTREAM-MERGE-0000000018
    Sink: KSTREAM-SINK-0000000002 (topic: firstFilter)
      <-- KSTREAM-FILTER-0000000001

  Sub-topology: 1
    Source: KSTREAM-SOURCE-0000000004 (topics: [edges])
      --> KSTREAM-FILTER-0000000005
    Processor: KSTREAM-FILTER-0000000005 (stores: [])
      --> KSTREAM-WINDOWED-0000000007
      <-- KSTREAM-SOURCE-0000000004
    Source: KSTREAM-SOURCE-0000000003 (topics: [firstFilter])
      --> KSTREAM-WINDOWED-0000000006
    Processor: KSTREAM-WINDOWED-0000000006 (stores: [KSTREAM-JOINTHIS-0000000008-store])
      --> KSTREAM-JOINTHIS-0000000008
      <-- KSTREAM-SOURCE-0000000003
    Processor: KSTREAM-WINDOWED-0000000007 (stores: [KSTREAM-JOINOTHER-0000000009-store])
      --> KSTREAM-JOINOTHER-0000000009
      <-- KSTREAM-FILTER-0000000005
    Processor: KSTREAM-JOINOTHER-0000000009 (stores: [KSTREAM-JOINTHIS-0000000008-store])
      --> KSTREAM-MERGE-0000000010
      <-- KSTREAM-WINDOWED-0000000007
    Processor: KSTREAM-JOINTHIS-0000000008 (stores: [KSTREAM-JOINOTHER-0000000009-store])
      --> KSTREAM-MERGE-0000000010
      <-- KSTREAM-WINDOWED-0000000006
    Processor: KSTREAM-MERGE-0000000010 (stores: [])
      --> KSTREAM-MAP-0000000011
      <-- KSTREAM-JOINTHIS-0000000008, KSTREAM-JOINOTHER-0000000009
    Processor: KSTREAM-MAP-0000000011 (stores: [])
      --> KSTREAM-SINK-0000000012
      <-- KSTREAM-MERGE-0000000010
    Sink: KSTREAM-SINK-0000000012 (topic: firstJoin)
      <-- KSTREAM-MAP-0000000011

pom.xml

<groupId>KafkaJOINS</groupId>
    <artifactId>KafkaJOINS</artifactId>
    <version>1.0</version>

    <repositories>
        <repository>
            <id>confluent</id>
            <url>https://packages.confluent.io/maven/</url>
        </repository>
    </repositories>

    <pluginRepositories>
        <pluginRepository>
            <id>confluent</id>
            <url>https://packages.confluent.io/maven/</url>
        </pluginRepository>
    </pluginRepositories>

    <properties>
        <log4j.version>2.13.3</log4j.version>
        <avro.version>1.9.2</avro.version>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <confluent.version>6.0.0</confluent.version>
        <kafka.version>6.0.0-ccs</kafka.version>
    </properties>

    <dependencies>
        <dependency>
            <groupId>org.apache.kafka</groupId>
            <artifactId>kafka-streams</artifactId>
            <version>${kafka.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.kafka</groupId>
            <artifactId>kafka-clients</artifactId>
            <version>${kafka.version}</version>
        </dependency><dependency>
            <groupId>io.confluent</groupId>
            <artifactId>kafka-streams-avro-serde</artifactId>
            <version>${confluent.version}</version>
        </dependency>
        <dependency>
            <groupId>io.confluent</groupId>
            <artifactId>kafka-avro-serializer</artifactId>
            <version>${confluent.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.avro</groupId>
            <artifactId>avro</artifactId>
            <version>${avro.version}</version>
        </dependency>

在您的第一個ValueJoiner ,您創建一個新的新 object:

Object[] segments = {node,edge};

在您的第二個ValueJoiner中,您將獲得一個列表並將其添加到其中。 不過,您需要深度復制列表:

// your code
List<Object> segments = pattern.getSegments();
segments.add(node); // this effectively modifies the input object;
                    // if this input object joins multiple times,
                    // you may introduce an undesired side effect

// instead you should do
List<Object> segments = new LinkedList<>(pattern.getSegments());
segments.add(node);

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM