Spark结构化流：将流与应在每个微型批次中读取的数据合并

Question

I have a stream from HDFS and I need to join it with my metadata that is also in HDFS, both Parquets. 我有一个来自HDFS的流，我需要将它与同样包含在HDFS中的元数据（两个Parquets）一起加入。

My metadata sometimes got updated and I need to join with fresh and most recent, that means read metadata from HDFS every stream micro batch ideally. 我的元数据有时会更新，我需要加入最新的内容，这意味着理想情况下，每个流微型批次都可以从HDFS读取元数据。

I tried to test this, but unfortunately Spark reads metadata once that cache files(supposedly), even if I tried with spark.sql.parquet.cacheMetadata=false . 我试图对此进行测试，但是不幸的是，即使我尝试使用spark.sql.parquet.cacheMetadata=false ， spark.sql.parquet.cacheMetadata=false也会在缓存该文件后读取元数据。

Is there a way how to read every micro batch? 有没有办法读取每个微型批次？ Foreach Writer is not what I'm looking for? 不是我要找的Foreach Writer？

Here's code examples: 下面是代码示例：

spark.sql("SET spark.sql.streaming.schemaInference=true")

spark.sql("SET spark.sql.parquet.cacheMetadata=false")

val stream = spark.readStream.parquet("/tmp/streaming/")

val metadata = spark.read.parquet("/tmp/metadata/")

val joinedStream = stream.join(metadata, Seq("id"))

joinedStream.writeStream.option("checkpointLocation", "/tmp/streaming-test/checkpoint").format("console").start()



/tmp/metadata/ got updated with spark append mode.

As far as I understand, with metadata accessing through JDBC jdbc source and spark structured streaming , Spark will query each micro batch. 据我了解，通过元数据通过JDBC jdbc源访问和spark结构化的流访问，Spark将查询每个微型批处理。

Answer 1

As far as I found, there are two options: 据我发现，有两种选择：

Create temp view and refresh it using interval: 创建临时视图并使用间隔刷新它：
metadata.createOrReplaceTempView("metadata") metadata.createOrReplaceTempView（“ metadata”）

and trigger refresh in separate thread: 并在单独的线程中触发刷新：

spark.catalog.refreshTable("metadata")

NOTE : in this case spark will read the same path only, it does not work if you need read metadata from different folders on HDFS, eg with timestamps etc. 注意：在这种情况下，spark只会读取相同的路径，如果您需要从HDFS上的不同文件夹中读取元数据（例如带有时间戳等），它将无法正常工作。

Restart stream with interval as Tathagata Das suggested 按照Tathagata Das的建议以间隔重新启动流

This way is not suitable for me, since my metadata might be refreshed several times per hour. 这种方式不适合我，因为我的元数据可能每小时刷新几次。

Spark结构化流：将流与应在每个微型批次中读取的数据合并

问题描述

1 个解决方案

解决方案1
0 已采纳 2018-07-25 19:33:25

Spark结构化流：将流与应在每个微型批次中读取的数据合并

问题描述

1 个解决方案

解决方案1 0 已采纳 2018-07-25 19:33:25

解决方案1
0 已采纳 2018-07-25 19:33:25