简体   繁体   English

在Spark结构流2.3.0中连接两个流时,左外部连接不发出空值

[英]Left outer join not emitting null values when joining two streams in spark structured streaming 2.3.0

Left outer join on two streams not emitting the null outputs. 两个流的左外部联接不发出空输出。 It is just waiting for the record to be added to the other stream. 它只是在等待将记录添加到另一个流中。 Using socketstream to test this. 使用套接字流对此进行测试。 In our case, we want to emit the records with null values which don't match with id or/and not fall in time range condition 在我们的例子中,我们要发出的空值与id或/不匹配且不属于时间范围条件的记录

Details of the watermarks and intervals are: 水印和间隔的详细信息是:

val ds1Map = ds1
.selectExpr("Id AS ds1_Id", "ds1_timestamp")
.withWatermark("ds1_timestamp","10 seconds")

val ds2Map = ds2
.selectExpr("Id AS ds2_Id", "ds2_timestamp")
.withWatermark("ds2_timestamp", "20 seconds")

val output = ds1Map.join( ds2Map,
expr(
""" ds1_Id = ds2_Id AND ds2_timestamp >= ds1_timestamp AND  ds2_timestamp <= ds1_timestamp + interval 1 minutes """),
"leftOuter")

val query = output.select("*")
.writeStream

.outputMode(OutputMode.Append)
.format("console")
.option("checkpointLocation", "./spark-checkpoints/")
.start()

query.awaitTermination()

Thank you. 谢谢。

This may be due to one of the caveats of the micro-batch architecture implementation as noted in the developers guide here: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#semantic-guarantees-of-stream-stream-inner-joins-with-watermarking 这可能是由于此处的开发人员指南中所述的微批处理体系结构实现的警告之一: https//spark.apache.org/docs/latest/structured-streaming-programming-guide.html#semantic-保证流内的流与水印连接

In the current implementation in the micro-batch engine, watermarks are advanced at the end of a micro-batch, and the next micro-batch uses the updated watermark to clean up state and output outer results. 在微批处理引擎的当前实现中,水印在微批处理的末尾进行,下一个微批处理使用更新的水印来清理状态并输出外部结果。 Since we trigger a micro-batch only when there is new data to be processed, the generation of the outer result may get delayed if there no new data being received in the stream. 由于我们仅在有新数据要处理时才触发微批处理,因此如果流中未接收到新数据,则外部结果的生成可能会延迟。 In short, if any of the two input streams being joined does not receive data for a while, the outer (both cases, left or right) output may get delayed. 简而言之,如果连接的两个输入流中的任何一个在一段时间内未接收到数据,则外部(两种情况,左或右)输出可能会延迟。

This was the case for me where the null data was not getting flushed out until a further batch was triggered sometime later 对我来说就是这样,直到一段时间后又触发了另一个批处理,才清空空数据

Hi Jack and thanks for the response. 您好杰克,感谢您的答复。 question/issue was a year and a half ago and it took some time to recover what I did last year:), I run stream 2 stream join on two topics one with more the 10K sec msg and it was running on Spark cluster with 4.67 TB total memory with 1614 VCors total. 问题/问题是一年半之前的,并且花了一些时间来恢复我去年所做的事情:),我在两个主题上运行流2流连接,其中一个具有10K秒的味精,并且在4.67的Spark集群上运行TB总内存,共1614个VCor。

Implementation was simple structured streaming stream 2 stream join as in Spark official documents : 实施是简单的结构化流stream 2流联接,如Spark官方文档中所示:

// Join with event-time constraints
impressionsWithWatermark.join(
  clicksWithWatermark,
  expr("""
    clickAdId = impressionAdId AND
    clickTime >= impressionTime AND
    clickTime <= impressionTime + interval 1 hour
    """)
)

It was running for a few hours until OOM. 直到OOM运行了几个小时。 After investigation, I found out the issue about spark clean state in HDFSBackedStateStoreProvider and the open Jira in spark : 经过调查,我发现了HDFSBackedStateStoreProvider中的火花清除状态和spark中打开的Jira的问题:

https://issues.apache.org/jira/browse/SPARK-23682 https://issues.apache.org/jira/browse/SPARK-23682

Memory issue with spark structured streaming Spark结构化流的内存问题

And this is why I moved back and implemented stream to stream join in spark streaming 2.1.1 mapWithState. 这就是为什么我移回并在Spark Streaming 2.1.1 mapWithState中实现流到流连接的原因。

Thx 谢谢

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM