Spark Streaming：HDFS

Question

I can't get my Spark job to stream "old" files from HDFS. 我无法让我的Spark工作从HDFS流式传输“旧”文件。

If my Spark job is down for some reason (eg demo, deployment) but the writing/moving to HDFS directory is continuous, I might skip those files once I up the Spark Streaming Job. 如果由于某种原因（例如演示，部署）我的Spark作业停止了，但写入/移动到HDFS目录是连续的，我可能会在启动Spark Streaming Job后跳过这些文件。

    val hdfsDStream = ssc.textFileStream("hdfs://sandbox.hortonworks.com/user/root/logs")

    hdfsDStream.foreachRDD(
      rdd => logInfo("Number of records in this batch: " + rdd.count())
    )

Output --> Number of records in this batch: 0 输出 - >此批次中的记录数：0

Is there a way for Spark Streaming to move the "read" files to a different folder? Spark Streaming有没有办法将“读取”文件移动到另一个文件夹？ Or we have to program it manually? 或者我们必须手动编程？ So it will avoid reading already "read" files. 因此，它将避免读取已经“读取”的文件。
Is Spark Streaming the same as running the spark job (sc.textFile) in CRON? Spark Streaming与在CRON中运行spark job（sc.textFile）相同吗？

Answer 1

As Dean mentioned, textFileStream uses the default of only using new files. 正如Dean所提到的，textFileStream使用的默认值仅使用新文件。

  def textFileStream(directory: String): DStream[String] = {
    fileStream[LongWritable, Text, TextInputFormat](directory).map(_._2.toString)
  }

So, all it is doing is calling this variant of fileStream 所以，它正在做的就是调用fileStream这个变种

def fileStream[
    K: ClassTag,
    V: ClassTag,
    F <: NewInputFormat[K, V]: ClassTag
  ] (directory: String): InputDStream[(K, V)] = {
    new FileInputDStream[K, V, F](this, directory)
  }

And, looking at the FileInputDStream class we will see that it indeed can look for existing files, but defaults to new only: 并且，查看FileInputDStream类，我们将看到它确实可以查找现有文件，但默认为仅新建：

newFilesOnly: Boolean = true,

So, going back into the StreamingContext code, we can see that there is and overload we can use by directly calling the fileStream method: 所以，回到StreamingContext代码，我们可以看到有直接调用fileStream方法可以使用和重载：

def fileStream[
 K: ClassTag,
 V: ClassTag,
 F <: NewInputFormat[K, V]: ClassTag] 
(directory: String, filter: Path => Boolean, newFilesOnly: Boolean):InputDStream[(K, V)] = {
  new FileInputDStream[K, V, F](this, directory, filter, newFilesOnly)
}

So, the TL;DR; 所以，TL; DR; is 是

ssc.fileStream[LongWritable, Text, TextInputFormat]
    (directory, FileInputDStream.defaultFilter, false).map(_._2.toString)

Answer 2

Are you expecting Spark to read files already in the directory? 您是否希望Spark读取目录中已有的文件？ If so, this is a common misconception, one that took me by surprise. 如果是这样，这是一种常见的误解，令我感到意外。 textFileStream watches a directory for new files to appear, then it reads them. textFileStream要显示的新文件的目录，然后读取它们。 It ignores files already in the directory when you start or files it's already read. 当您启动或已经读取文件时，它会忽略目录中已有的文件。

The rationale is that you'll have some process writing files to HDFS, then you'll want Spark to read them. 理由是您将有一些进程将文件写入HDFS，然后您将希望Spark读取它们。 Note that these files much appear atomically, eg, they were slowly written somewhere else, then moved to the watched directory. 请注意，这些文件很多是以原子方式显示的，例如，它们在其他地方慢慢写入，然后移动到监视目录。 This is because HDFS doesn't properly handle reading and writing a file simultaneously. 这是因为HDFS无法正确处理同时读取和写入文件。

Answer 3

val filterF = new Function[Path, Boolean] {
    def apply(x: Path): Boolean = {
      println("looking if "+x+" to be consider or not")
      val flag = if(x.toString.split("/").last.split("_").last.toLong < System.currentTimeMillis){ println("considered "+x); list += x.toString; true}
       else{ false }
      return flag
    }
}

this filter function is used to determine whether each path is actually the one preferred by you. 此过滤器函数用于确定每个路径是否实际上是您首选的路径。 so the function inside the apply should be customized as per your requirement. 因此，应根据您的要求定制apply中的功能。

val streamed_rdd = ssc.fileStream[LongWritable, Text, TextInputFormat]("/user/hdpprod/temp/spark_streaming_output",filterF,false).map{case (x, y) => (y.toString)}

now you have to set the third variable of filestream function to false, this is to make sure not only new files but also consider old existing files in the streaming directory. 现在你必须将filestream函数的第三个变量设置为false，这不仅要确保新文件，还要考虑流式目录中的旧现有文件。

Spark Streaming：HDFS

问题描述

3 个解决方案

解决方案1
7 2015-03-13 15:44:05

So, the TL;DR; 所以，TL; DR; is 是

解决方案2
3 2015-03-13 00:31:02

解决方案3
0 2016-10-31 23:41:42

Spark Streaming：HDFS

问题描述

3 个解决方案

解决方案1 7 2015-03-13 15:44:05

So, the TL;DR; 所以，TL; DR; is 是

解决方案2 3 2015-03-13 00:31:02

解决方案3 0 2016-10-31 23:41:42

解决方案1
7 2015-03-13 15:44:05

解决方案2
3 2015-03-13 00:31:02

解决方案3
0 2016-10-31 23:41:42