简体   繁体   English

Spark Streaming仅流在流初始化时间之后创建的流文件

[英]Spark Streaming only streams files created after the stream initialization time

Is there any way to configure the textFileStream source such that it will process any file added to the source directory regardless of the file create time? 有什么方法可以配置textFileStream源,以便无论文件创建时间如何,它都将处理添加到源目录的任何文件?

To demonstrate the issue, I created a basic Spark Streaming application that uses textFileStream as a source and prints the stream contents to the console. 为了演示该问题,我创建了一个基本的Spark Streaming应用程序,该应用程序使用textFileStream作为源并将该流内容打印到控制台。 When an existing file created prior to running the application is copied into the source directory, nothing is printed to the console. 将在运行应用程序之前创建的现有文件复制到源目录中时,没有任何内容打印到控制台。 When a file created after the application starts running is copied to the source directory, the file contents are printed. 将应用程序开始运行后创建的文件复制到源目录时,将打印文件内容。 Below is my code for reference. 下面是我的代码供参考。

val conf = new SparkConf().setAppName("Streaming Test")
                          .setMaster("local[*]")

val spark = new SparkContext(conf)
val ssc = new StreamingContext(spark, Seconds(5))

val fileStream = ssc.textFileStream("/stream-source")

val streamContents = fileStream.flatMap(_.split(" "))

streamContents.print()

This is the documented behavior of the FileInputDStream . 这是FileInputDStream的已记录行为。

If we would like to consume existing files in that directory, we can use the Spark API to load these files and apply our desired logic to them. 如果我们想使用该目录中的现有文件,则可以使用Spark API加载这些文件并将所需的逻辑应用于它们。

val existingFiles = sparkContext.textFile(path)

or 要么

val existingFilesDS = sparkSession.read.text(path)

And then after, setup and start the streaming logic. 然后,设置并启动流逻辑。 We could even use the data of the already existing files in the processing of the new ones. 我们甚至可以在处理新文件时使用现有文件的数据。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM