简体   繁体   English

使用Spark Streaming读取fileStream

[英]Reading fileStream with Spark Streaming

I have a directory on HDFS where every 10 minutes a file is copied (the existing one is overwritten). 我在HDFS上有一个目录,每10分钟复制一个文件(现有文件将被覆盖)。 I'd like to read the content of a file with Spark streaming ( 1.6.0 ) and use it as a reference data to join it to an other stream. 我想使用Spark流( 1.6.0 )读取文件的内容,并将其用作参考数据以将其加入其他流。

I set the " remember window " spark.streaming.fileStream.minRememberDuration to " 600s " and set newFilesOnly to false , because when I start the application I wan't to fetch the initial data from HDFS which is already there. 我将“ 记住窗口spark.streaming.fileStream.minRememberDuration为“ 600s ”,并将newFilesOnly设置为false ,因为启动应用程序时,我不会从已经存在的HDFS中获取初始数据。

val ssc = new StreamingContext(sparkConf, Seconds(2))
def defaultFilter(path: Path): Boolean = !path.getName().startsWith(".")
val lines: DStream[String] = 
   ssc.fileStream[LongWritable, Text, TextInputFormat](loc, defaultFilter(_), false).map(_._2.toString)
lines.foreachRDD { x => x.foreach(println) }

My idea is to persist the content of this DStream into memory and delegate the task of maintaining this " batch lookup cache " to Spark. 我的想法是将此DStream的内容持久保存到内存中,并将维护此“ 批处理查找缓存 ”的任务委托给Spark。 I expect to automatically have fresh data after each change on the HDFS directory which I can join to the other stream. 我希望在HDFS目录中进行每次更改后,都可以自动拥有新数据,我可以将其加入其他流。

What I don't understand: 我不明白的是:

  • when I start the application the data is loaded but then if I touch the file locally and overwrite the one on HDFS I won't see its content printed out anymore 当我启动应用程序时,数据已加载,但是如果我在本地触摸文件并覆盖HDFS上的文件,我将不再看到其内容
  • how to cache and reload this data? 如何缓存和重新加载此数据?
  • When I cache it will this be available on the worker nodes or this (along with the join) will happen in the driver? 当我缓存它时,它将在工作程序节点上可用还是在驱动程序中发生(连同连接)?

Should I also set the StreamingContext time interval to 10 minutes as I will only have changes every 10 minutes? 我是否还应该将StreamingContext时间间隔设置为10分钟,因为我每10分钟只会更改一次?

Just a few raw ideas. 只是一些原始想法。

when I start the application the data is loaded but then if I touch the file locally and overwrite the one on HDFS I won't see its content printed out anymore 当我启动应用程序时,数据已加载,但是如果我在本地触摸文件并覆盖HDFS上的文件,我将不再看到其内容

For Spark Streaming to work with the data, the files have to be created atomically, eg by moving the file into the directory Spark is monitoring. 为了使Spark Streaming能够处理数据,必须原子创建文件,例如,通过将文件移动到Spark正在监视的目录中。 the File rename operation is typically atomic. 文件重命名操作通常是原子的。 Can you please test this to verify it is working? 您可以测试一下以验证其是否正常工作吗?

how to cache and reload this data? 如何缓存和重新加载此数据? When I cache it will this be available on the worker nodes or this (along with the join) will happen in the driver? 当我缓存它时,它将在工作程序节点上可用还是在驱动程序中发生(连同连接)?

The Straightforward solution might be to register temp table in foreachRDD() method. 简单的解决方案可能是在foreachRDD()方法中注册临时表。 When new data will come during streaming appropriate table can be recreated. 当在流传输期间会有新数据出现时,可以重新创建适当的表。 Keep in mind that logic inside foreachRDD() method should be idempotent. 请记住,foreachRDD()方法中的逻辑应该是幂等的。

Knowing the table name you can easily create a separate pipeline for querying that will join data from this precached temp table. 知道了表名后,您可以轻松创建一个单独的查询管道,该管道将联接来自此预缓存的临时表的数据。 Just make sure that you set the StreamingContext to remember a sufficient amount of streaming data such that the query can run. 只要确保将StreamingContext设置为记住足够的流数据即可运行查询。

Should I also set the StreamingContext time interval to 10 minutes as I will only have changes every 10 minutes? 我是否还应该将StreamingContext时间间隔设置为10分钟,因为我每10分钟只会更改一次?

In ideal case cadence should match. 在理想情况下,节奏应该匹配。 Just to be safe you can check timestamp when the new data received in foreachRDD() method too. 为了安全起见,您也可以检查何时在foreachRDD()方法中接收到新数据的时间戳。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM