![](/img/trans.png)
[英]What is the default WindowBytesStoreSupplier for stream-stream join in Kafka Streams?
[英]2 consecutive stream-stream inner joins produce wrong results: what does KStream join between streams really do internally?
我有一個 stream 節點和一個 stream 代表圖形的連續更新的邊,我想使用多個串聯連接來構建由節點和邊組成的模式。 假設我想匹配這樣的模式: (node1) --[edge1]--> (node2) 。
我的想法是將節點的 stream 與邊的 stream 結合起來,以組成類型為子模式的stream --> (node1) --> 然后將生成的 stream 與節點的 stream 再次連接,以組成最終模式(node1) --[edge1]--> (node2) 。 對特定類型的節點和邊的過濾並不重要。
所以我有以 Avro 格式構建的節點、邊緣和模式:
{
"namespace": "DataModel",
"type": "record",
"name": "Node",
"doc": "Node schema, it contains a nodeID label and properties",
"fields": [
{
"name": "nodeID",
"type": "long"
},
{
"name": "labels",
"type": {
"type": "array",
"items": "string",
"avro.java.string": "String"
}
},
{
"name": "properties",
"type": {
"type": "map",
"values": "string",
"avro.java.string": "String"
}
},
{
"name": "timestamp",
"type": "long"
}
]
}
{
"namespace": "DataModel",
"type": "record",
"name": "Edge",
"doc": "contains edgeID, a type, a list of properties, a starting node ID and an ending node ID ",
"fields": [
{
"name": "edgeID",
"type": "long"
},
{
"name": "type",
"type": "string"
},
{
"name": "properties",
"type": {
"type": "map",
"values": "string",
"avro.java.string": "String"
}
},
{
"name": "startID",
"type": "long"
},
{
"name": "endID",
"type": "long"
},
{
"name": "timestamp",
"type": "long"
}
]
}
{
"namespace": "DataModel",
"type": "record",
"name": "Pattern",
"fields": [
{
"name": "first",
"type": "long"
},
{
"name": "nextJoinID",
"type": [
"null",
"long"
],
"default": null
},
{
"name": "timestamp",
"type": "long"
},
{
"name": "segments",
"doc": "It's the ordered list of nodes and edges that compose this sub-pattern from the leftmost node to the rightmost edge or node",
"type": {
"type": "array",
"items": [
"DataModel.Node",
"DataModel.Edge"
]
}
}
然后我有以下兩個ValueJoiners:
第一個用於節點 stream 和邊 stream 的內部連接。
第二個用於超模式 stream 和節點 stream 的內部連接。
public class NodeEdgeJoiner implements ValueJoiner<Node, Edge, Pattern> {
@Override
public Pattern apply(Node node, Edge edge) {
Object[] segments = {node,edge};
return Pattern.newBuilder()
.setFirst(node.getNodeID())
.setNextJoinID(edge.getEndID())
.setSegments(Arrays.asList(segments))
.setTimestamp(Math.min(node.getTimestamp(),edge.getTimestamp()))
.build();
}
}
public class PatternNodeJoiner implements ValueJoiner<Pattern, Node, Pattern> {
@Override
public Pattern apply(Pattern pattern, Node node) {
List<Object> segments = pattern.getSegments();
segments.add(node);
return Pattern.newBuilder()
.setFirst(pattern.getFirst())
.setNextJoinID(node.getNodeID())
.setSegments(segments)
.setTimestamp(Math.min(node.getTimestamp(),pattern.getTimestamp()))
.build();
}
}
我的意圖是捕捉如下模式: (nodeId == 1)--[label == "related_to"]-->()其中
將這些片段連接在一起的想法是使用之前的 Valuejoiners 執行兩個連續的連接。 我希望您關注兩個 ValueJoiners 執行的第一個操作:為了構建模式,我只是簡單地將 append 節點和邊放在作為模式的 Avro 模式的一部分的列表的末尾。 以下是生成節點和邊並將它們發布在相應主題中的通用循環。 每個節點記錄的key對應於nodeID,每個邊記錄的key是邊的傳入節點的nodeID。
while(true){
try (final KafkaProducer<Long, Node> nodeKafkaProducer = new KafkaProducer<Long, Node>(props)) {
final KafkaProducer<Long, Edge> edgeKafkaProducer = new KafkaProducer<Long, Edge>(props);
nodeKafkaProducer.send(new ProducerRecord<Long, Node>(nodeTopic, (long) 1,
buildNodeRecord(1, Collections.singletonList("aString"), "aString",
System.currentTimeMillis())));
edgeKafkaProducer.send(new ProducerRecord<Long, Edge>(edgesTopic, (long) 1,
buildEdgeRecord(1, 1, 4, "related_to", "aString",
System.currentTimeMillis())));
Thread.sleep(9000);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
在哪里:
private Node buildNodeRecord(long nodeId, List<String> labelsToSet, String property, long timestamp){
Node record = new Node();
record.setNodeID(nodeId);
record.setLabels(labelsToSet);
Map<String, String> propMap = new HashMap<String, String>();
propMap.put("property", property);
record.setProperties(propMap);
record.setTimestamp(timestamp);
return record;
}
private Edge buildEdgeRecord(long edgeId,long startID, long endID, String type, String property, long timestamp) {
Edge record = new Edge();
record.setEdgeID(edgeId);
record.setStartID(startID);
record.setEndID(endID);
record.setType(type);
Map<String,String> propMap = new HashMap<String, String>();
propMap.put("property",property);
record.setProperties(propMap);
record.setTimestamp(timestamp);
return record;
}
代碼的以下部分描述了管道。
//configuration of specific avro serde for pattern type
final SpecificAvroSerde<Pattern> patternSpecificAvroSerde = new SpecificAvroSerde<>();
final Map<String, String> serdeConfig = Collections.singletonMap(
AbstractKafkaSchemaSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG, envProps.getProperty("schema.registry.url"));
patternSpecificAvroSerde.configure(serdeConfig,false);
//the valueJoiners we need
final NodeEdgeJoiner nodeEdgeJoiner = new NodeEdgeJoiner();
final PatternNodeJoiner patternNodeJoiner = new PatternNodeJoiner();
//timestampExtractors
NodeTimestampExtractor nodeTimestampExtractor = new NodeTimestampExtractor();
SubPatternTimeStampExtractor subPatternTimeStampExtractor = new SubPatternTimeStampExtractor();
EdgeTimestampExtractor edgeTimestampExtractor = new EdgeTimestampExtractor();
//node source
final KStream<Long, Node> nodeKStream = builder.stream(envProps.getProperty("node.topic.name"),
Consumed.with(nodeTimestampExtractor));
//filter on nodes topic
nodeKStream.filter((key, value) -> value.getNodeID()==1).to(envProps.getProperty("firstnodes.topic.name"));
final KStream<Long,Node> firstFilteredNodes = builder.stream(envProps.getProperty("firstnodes.topic.name"),
Consumed.with(nodeTimestampExtractor));
//edges keyed by incoming node
final KStream<Long,Edge> edgeKstream = builder.stream(envProps.getProperty("edge.topic.name"),
Consumed.with(edgeTimestampExtractor));
//filter operation on edges for the first part of the pattern
final KStream<Long,Edge> firstEdgeFiltered = edgeKstream.filter((key, value) ->
value.getType().equals("related_to"));
//first join
firstFilteredNodes.join(firstEdgeFiltered,nodeEdgeSubJoiner,
JoinWindows.of(Duration.ofSeconds(10)))
.map((key, value) -> new KeyValue<Long, SubPattern>(value.getNextJoinID(), value))
.to(envProps.getProperty("firstJoin.topic.name"));
final KStream <Long,SubPattern> mappedFirstJoin = builder.stream(envProps.getProperty("firstJoin.topic.name"),
Consumed.with(subPatternTimeStampExtractor));
//second join
KStream <Long,Pattern> secondJoin = mappedFirstJoin
.join(nodeKStream,subPatternNodeJoiner, JoinWindows.of(Duration.ofSeconds(10)));
secondJoin.print(Printed.toSysOut()); // should print out final records
我不會展示時間戳提取器,因為我認為它們與重點無關。
所以我希望 output 是模式記錄的 stream和每個模式的列表(來自 Avro 模式的“段”)的大小相同:1 個節點 1 個邊和另一個節點。 但這不會發生。 相反,我得到了這個 output:
[KSTREAM-MERGE-0000000018]: 4, {"first": 1, "nextJoinID": 4, "timestamp": 1611252427338, "segments": [{"nodeID": 1, "labels": ["aString"], "properties": {"property": "aString"}, "timestamp": 1611252427338}, {"edgeID": 1, "type": "related_to", "properties": {"property": "aString"}, "startID": 1, "endID": 4, "timestamp": 1611252427777}, {"nodeID": 4, "labels": ["aString"], "properties": {"property": "aString"}, "timestamp": 1611252427795}]}
[KSTREAM-MERGE-0000000018]: 4, {"first": 1, "nextJoinID": 4, "timestamp": 1611252427338, "segments": [{"nodeID": 1, "labels": ["aString"], "properties": {"property": "aString"}, "timestamp": 1611252427338}, {"edgeID": 1, "type": "related_to", "properties": {"property": "aString"}, "startID": 1, "endID": 4, "timestamp": 1611252427777}, {"nodeID": 4, "labels": ["aString"], "properties": {"property": "aString"}, "timestamp": 1611252427795}, {"nodeID": 4, "labels": ["aString"], "properties": {"property": "aString"}, "timestamp": 1611252436847}]}
[KSTREAM-MERGE-0000000018]: 4, {"first": 1, "nextJoinID": 4, "timestamp": 1611252427338, "segments": [{"nodeID": 1, "labels": ["aString"], "properties": {"property": "aString"}, "timestamp": 1611252427338}, {"edgeID": 1, "type": "related_to", "properties": {"property": "aString"}, "startID": 1, "endID": 4, "timestamp": 1611252436837}, {"nodeID": 4, "labels": ["aString"], "properties": {"property": "aString"}, "timestamp": 1611252427795}]}
[KSTREAM-MERGE-0000000018]: 4, {"first": 1, "nextJoinID": 4, "timestamp": 1611252427338, "segments": [{"nodeID": 1, "labels": ["aString"], "properties": {"property": "aString"}, "timestamp": 1611252427338}, {"edgeID": 1, "type": "related_to", "properties": {"property": "aString"}, "startID": 1, "endID": 4, "timestamp": 1611252436837}, {"nodeID": 4, "labels": ["aString"], "properties": {"property": "aString"}, "timestamp": 1611252427795}, {"nodeID": 4, "labels": ["aString"], "properties": {"property": "aString"}, "timestamp": 1611252436847}]}
[KSTREAM-MERGE-0000000018]: 4, {"first": 1, "nextJoinID": 4, "timestamp": 1611252427777, "segments": [{"nodeID": 1, "labels": ["aString"], "properties": {"property": "aString"}, "timestamp": 1611252436822}, {"edgeID": 1, "type": "related_to", "properties": {"property": "aString"}, "startID": 1, "endID": 4, "timestamp": 1611252427777}, {"nodeID": 4, "labels": ["aString"], "properties": {"property": "aString"}, "timestamp": 1611252427795}]}
[KSTREAM-MERGE-0000000018]: 4, {"first": 1, "nextJoinID": 4, "timestamp": 1611252427777, "segments": [{"nodeID": 1, "labels": ["aString"], "properties": {"property": "aString"}, "timestamp": 1611252436822}, {"edgeID": 1, "type": "related_to", "properties": {"property": "aString"}, "startID": 1, "endID": 4, "timestamp": 1611252427777}, {"nodeID": 4, "labels": ["aString"], "properties": {"property": "aString"}, "timestamp": 1611252427795}, {"nodeID": 4, "labels": ["aString"], "properties": {"property": "aString"}, "timestamp": 1611252436847}]}
[KSTREAM-MERGE-0000000018]: 4, {"first": 1, "nextJoinID": 4, "timestamp": 1611252427795, "segments": [{"nodeID": 1, "labels": ["aString"], "properties": {"property": "aString"}, "timestamp": 1611252436822}, {"edgeID": 1, "type": "related_to", "properties": {"property": "aString"}, "startID": 1, "endID": 4, "timestamp": 1611252436837}, {"nodeID": 4, "labels": ["aString"], "properties": {"property": "aString"}, "timestamp": 1611252427795}]}
[KSTREAM-MERGE-0000000018]: 4, {"first": 1, "nextJoinID": 4, "timestamp": 1611252436822, "segments": [{"nodeID": 1, "labels": ["aString"], "properties": {"property": "aString"}, "timestamp": 1611252436822}, {"edgeID": 1, "type": "related_to", "properties": {"property": "aString"}, "startID": 1, "endID": 4, "timestamp": 1611252436837}, {"nodeID": 4, "labels": ["aString"], "properties": {"property": "aString"}, "timestamp": 1611252427795}, {"nodeID": 4, "labels": ["aString"], "properties": {"property": "aString"}, "timestamp": 1611252436847}]}
.
.
.
如您所見,每條記錄中的有序節點和邊數組的大小是不同的。 特別是我總是在它們中看到:一個節點和一條邊,后面跟着許多節點。 如果我減少 while(true){...} 中的睡眠毫秒數,它會變得更糟,並生成非常長的列表,列表中有更多節點。 我保證節點-邊連接在每種情況下都表現良好。 它總是產生正確的結果。 該問題似乎影響了第二次加入。 但我不明白如何..我試圖做一些測試但沒有成功..
以下是拓撲:
Topologies:
Sub-topology: 0
Source: KSTREAM-SOURCE-0000000000 (topics: [nodes])
--> KSTREAM-WINDOWED-0000000015, KSTREAM-FILTER-0000000001
Source: KSTREAM-SOURCE-0000000013 (topics: [firstJoin])
--> KSTREAM-WINDOWED-0000000014
Processor: KSTREAM-WINDOWED-0000000014 (stores: [KSTREAM-JOINTHIS-0000000016-store])
--> KSTREAM-JOINTHIS-0000000016
<-- KSTREAM-SOURCE-0000000013
Processor: KSTREAM-WINDOWED-0000000015 (stores: [KSTREAM-JOINOTHER-0000000017-store])
--> KSTREAM-JOINOTHER-0000000017
<-- KSTREAM-SOURCE-0000000000
Processor: KSTREAM-JOINOTHER-0000000017 (stores: [KSTREAM-JOINTHIS-0000000016-store])
--> KSTREAM-MERGE-0000000018
<-- KSTREAM-WINDOWED-0000000015
Processor: KSTREAM-JOINTHIS-0000000016 (stores: [KSTREAM-JOINOTHER-0000000017-store])
--> KSTREAM-MERGE-0000000018
<-- KSTREAM-WINDOWED-0000000014
Processor: KSTREAM-FILTER-0000000001 (stores: [])
--> KSTREAM-SINK-0000000002
<-- KSTREAM-SOURCE-0000000000
Processor: KSTREAM-MERGE-0000000018 (stores: [])
--> KSTREAM-PRINTER-0000000019
<-- KSTREAM-JOINTHIS-0000000016, KSTREAM-JOINOTHER-0000000017
Processor: KSTREAM-PRINTER-0000000019 (stores: [])
--> none
<-- KSTREAM-MERGE-0000000018
Sink: KSTREAM-SINK-0000000002 (topic: firstFilter)
<-- KSTREAM-FILTER-0000000001
Sub-topology: 1
Source: KSTREAM-SOURCE-0000000004 (topics: [edges])
--> KSTREAM-FILTER-0000000005
Processor: KSTREAM-FILTER-0000000005 (stores: [])
--> KSTREAM-WINDOWED-0000000007
<-- KSTREAM-SOURCE-0000000004
Source: KSTREAM-SOURCE-0000000003 (topics: [firstFilter])
--> KSTREAM-WINDOWED-0000000006
Processor: KSTREAM-WINDOWED-0000000006 (stores: [KSTREAM-JOINTHIS-0000000008-store])
--> KSTREAM-JOINTHIS-0000000008
<-- KSTREAM-SOURCE-0000000003
Processor: KSTREAM-WINDOWED-0000000007 (stores: [KSTREAM-JOINOTHER-0000000009-store])
--> KSTREAM-JOINOTHER-0000000009
<-- KSTREAM-FILTER-0000000005
Processor: KSTREAM-JOINOTHER-0000000009 (stores: [KSTREAM-JOINTHIS-0000000008-store])
--> KSTREAM-MERGE-0000000010
<-- KSTREAM-WINDOWED-0000000007
Processor: KSTREAM-JOINTHIS-0000000008 (stores: [KSTREAM-JOINOTHER-0000000009-store])
--> KSTREAM-MERGE-0000000010
<-- KSTREAM-WINDOWED-0000000006
Processor: KSTREAM-MERGE-0000000010 (stores: [])
--> KSTREAM-MAP-0000000011
<-- KSTREAM-JOINTHIS-0000000008, KSTREAM-JOINOTHER-0000000009
Processor: KSTREAM-MAP-0000000011 (stores: [])
--> KSTREAM-SINK-0000000012
<-- KSTREAM-MERGE-0000000010
Sink: KSTREAM-SINK-0000000012 (topic: firstJoin)
<-- KSTREAM-MAP-0000000011
pom.xml
<groupId>KafkaJOINS</groupId>
<artifactId>KafkaJOINS</artifactId>
<version>1.0</version>
<repositories>
<repository>
<id>confluent</id>
<url>https://packages.confluent.io/maven/</url>
</repository>
</repositories>
<pluginRepositories>
<pluginRepository>
<id>confluent</id>
<url>https://packages.confluent.io/maven/</url>
</pluginRepository>
</pluginRepositories>
<properties>
<log4j.version>2.13.3</log4j.version>
<avro.version>1.9.2</avro.version>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<confluent.version>6.0.0</confluent.version>
<kafka.version>6.0.0-ccs</kafka.version>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-streams</artifactId>
<version>${kafka.version}</version>
</dependency>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>${kafka.version}</version>
</dependency><dependency>
<groupId>io.confluent</groupId>
<artifactId>kafka-streams-avro-serde</artifactId>
<version>${confluent.version}</version>
</dependency>
<dependency>
<groupId>io.confluent</groupId>
<artifactId>kafka-avro-serializer</artifactId>
<version>${confluent.version}</version>
</dependency>
<dependency>
<groupId>org.apache.avro</groupId>
<artifactId>avro</artifactId>
<version>${avro.version}</version>
</dependency>
在您的第一個ValueJoiner
,您創建一個新的新 object:
Object[] segments = {node,edge};
在您的第二個ValueJoiner
中,您將獲得一個列表並將其添加到其中。 不過,您需要深度復制列表:
// your code
List<Object> segments = pattern.getSegments();
segments.add(node); // this effectively modifies the input object;
// if this input object joins multiple times,
// you may introduce an undesired side effect
// instead you should do
List<Object> segments = new LinkedList<>(pattern.getSegments());
segments.add(node);
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.