简体   繁体   English

Spark结构化流:将流与应在每个微型批次中读取的数据合并

[英]Spark Structured Streaming: join stream with data that should be read every micro batch

I have a stream from HDFS and I need to join it with my metadata that is also in HDFS, both Parquets. 我有一个来自HDFS的流,我需要将它与同样包含在HDFS中的元数据(两个Parquets)一起加入。

My metadata sometimes got updated and I need to join with fresh and most recent, that means read metadata from HDFS every stream micro batch ideally. 我的元数据有时会更新,我需要加入最新的内容,这意味着理想情况下,每个流微型批次都可以从HDFS读取元数据。

I tried to test this, but unfortunately Spark reads metadata once that cache files(supposedly), even if I tried with spark.sql.parquet.cacheMetadata=false . 我试图对此进行测试,但是不幸的是,即使我尝试使用spark.sql.parquet.cacheMetadata=falsespark.sql.parquet.cacheMetadata=false也会在缓存该文件后读取元数据。

Is there a way how to read every micro batch? 有没有办法读取每个微型批次? Foreach Writer is not what I'm looking for? 不是我要找的Foreach Writer?

Here's code examples: 下面是代码示例:

spark.sql("SET spark.sql.streaming.schemaInference=true")

spark.sql("SET spark.sql.parquet.cacheMetadata=false")

val stream = spark.readStream.parquet("/tmp/streaming/")

val metadata = spark.read.parquet("/tmp/metadata/")

val joinedStream = stream.join(metadata, Seq("id"))

joinedStream.writeStream.option("checkpointLocation", "/tmp/streaming-test/checkpoint").format("console").start()



/tmp/metadata/ got updated with spark append mode.

As far as I understand, with metadata accessing through JDBC jdbc source and spark structured streaming , Spark will query each micro batch. 据我了解,通过元数据通过JDBC jdbc源访问和spark结构化的流访问 ,Spark将查询每个微型批处理。

As far as I found, there are two options: 据我发现,有两种选择:

  1. Create temp view and refresh it using interval: 创建临时视图并使用间隔刷新它:

    metadata.createOrReplaceTempView("metadata") metadata.createOrReplaceTempView(“ metadata”)

and trigger refresh in separate thread: 并在单独的线程中触发刷新:

spark.catalog.refreshTable("metadata")

NOTE : in this case spark will read the same path only, it does not work if you need read metadata from different folders on HDFS, eg with timestamps etc. 注意 :在这种情况下,spark只会读取相同的路径,如果您需要从HDFS上的不同文件夹中读取元数据(例如带有时间戳等),它将无法正常工作。

  1. Restart stream with interval as Tathagata Das suggested 按照Tathagata Das的建议以间隔重新启动流

This way is not suitable for me, since my metadata might be refreshed several times per hour. 这种方式不适合我,因为我的元数据可能每小时刷新几次。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM