[英]Spark Structured Streaming: join stream with data that should be read every micro batch
I have a stream from HDFS and I need to join it with my metadata that is also in HDFS, both Parquets. 我有一个来自HDFS的流,我需要将它与同样包含在HDFS中的元数据(两个Parquets)一起加入。
My metadata sometimes got updated and I need to join with fresh and most recent, that means read metadata from HDFS every stream micro batch ideally. 我的元数据有时会更新,我需要加入最新的内容,这意味着理想情况下,每个流微型批次都可以从HDFS读取元数据。
I tried to test this, but unfortunately Spark reads metadata once that cache files(supposedly), even if I tried with spark.sql.parquet.cacheMetadata=false
. 我试图对此进行测试,但是不幸的是,即使我尝试使用
spark.sql.parquet.cacheMetadata=false
, spark.sql.parquet.cacheMetadata=false
也会在缓存该文件后读取元数据。
Is there a way how to read every micro batch? 有没有办法读取每个微型批次? Foreach Writer is not what I'm looking for?
不是我要找的Foreach Writer?
Here's code examples: 下面是代码示例:
spark.sql("SET spark.sql.streaming.schemaInference=true")
spark.sql("SET spark.sql.parquet.cacheMetadata=false")
val stream = spark.readStream.parquet("/tmp/streaming/")
val metadata = spark.read.parquet("/tmp/metadata/")
val joinedStream = stream.join(metadata, Seq("id"))
joinedStream.writeStream.option("checkpointLocation", "/tmp/streaming-test/checkpoint").format("console").start()
/tmp/metadata/ got updated with spark append mode.
As far as I understand, with metadata accessing through JDBC jdbc source and spark structured streaming , Spark will query each micro batch. 据我了解,通过元数据通过JDBC jdbc源访问和spark结构化的流访问 ,Spark将查询每个微型批处理。
As far as I found, there are two options: 据我发现,有两种选择:
Create temp view and refresh it using interval: 创建临时视图并使用间隔刷新它:
metadata.createOrReplaceTempView("metadata") metadata.createOrReplaceTempView(“ metadata”)
and trigger refresh in separate thread: 并在单独的线程中触发刷新:
spark.catalog.refreshTable("metadata")
NOTE : in this case spark will read the same path only, it does not work if you need read metadata from different folders on HDFS, eg with timestamps etc. 注意 :在这种情况下,spark只会读取相同的路径,如果您需要从HDFS上的不同文件夹中读取元数据(例如带有时间戳等),它将无法正常工作。
This way is not suitable for me, since my metadata might be refreshed several times per hour. 这种方式不适合我,因为我的元数据可能每小时刷新几次。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.