Spark FileStreaming无法与foreachRDD一起使用

Question

I'm newbie to Spark, and i'm building a Small sample application which is a Spark fileStreaming one. 我是Spark的新手，并且正在构建一个Small示例应用程序，它是一个Spark fileStreaming应用程序。 All i wanted is to read the whole file in one go instead of reading line by line(i guess this is what the textFileStream does). 我想要的只是一次读取整个文件，而不是逐行读取（我想这就是textFileStream所做的事情）。

The code is below: 代码如下：

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.StreamingContext._

import org.apache.hadoop.io.LongWritable
import org.apache.hadoop.io.Text
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat

import scalax.io._

object SampleXML{

    def main(args: Array[String]){

        val logFile = "/home/akhld/mobi/spark-streaming/logs/sample.xml"

        val ssc = new StreamingContext("spark://localhost:7077","XML Streaming Job",Seconds(5),"/home/akhld/mobi/spark-streaming/spark-0.8.0-incubating",List("target/scala-2.9.3/simple-project_2.9.3-1.0.jar"))

        val lines = ssc.fileStream[LongWritable, Text, TextInputFormat]("/home/akhld/mobi/spark-streaming/logs/")

        lines.print()

        lines.foreachRDD(rdd => {
          rdd.count()  // prints counts

        })


        ssc.start()


    }
}

This code is failing with an exception saying that: 此代码失败，并显示以下异常：

[error] /home/akhld/mobi/spark-streaming/samples/samplexml/src/main/scala/SampleXML.scala:31: value foreachRDD is not a member of org.apache.spark.streaming.DStream[(org.apache.hadoop.io.LongWritable, org.apache.hadoop.io.Text)]
[error]         ssc.fileStream[LongWritable, Text, TextInputFormat]("/home/akhld/mobi/spark-streaming/logs/").foreachRDD(rdd => {
[error]                                                                                                       ^
[error] one error found
[error] (compile:compile) Compilation failed
[error] Total time: 3 s, completed Feb 3, 2014 7:32:57 PM

If this is not the right way of displaying the contents of the files in the stream, Please help me with an example. 如果这不是显示流中文件内容的正确方法，请帮我举个例子。 I searched a lot but couldn't find the proper one to use fileStream. 我搜索了很多，但找不到合适的文件来使用fileStream。

Answer 1

Well, textFileStream in the Spark Streaming is more meant to continuously read and process files that are being written inside a directory. 好吧，Spark Streaming中的textFileStream旨在继续读取和处理正在目录中写入的文件。 So if you have to process one file as a whole in one shot, it is simpler to use Spark directly! 因此，如果您必须一次处理一个文件，则直接使用Spark会更简单！

 val lines = sparkContext.textFile(<file URL>)
 lines.foreach(line => println(line))

This will print all the lines in the file. 这将打印文件中的所有行。

Answer 2

Also, I beleive the issue here is that you cannot call count() on an RDD inside a forEach stream. 另外，我相信这里的问题是您不能在forEach流中的RDD上调用count（）。 The reason is that, if you do, I think it blocks progress of the forEach block - and the stream consumer stops working. 原因是，如果您这样做，我认为它会阻止forEach块的进度-并且流使用者停止工作。

I've created a JIRA for this https://issues.apache.org/jira/browse/SPARK-4040 . 我为此https://issues.apache.org/jira/browse/SPARK-4040创建了一个JIRA。

I think there are some sensitive API calls you can make on RDDs when you are referencing them in a forEach block , but havent quite worked out all the details yet. 我认为当在forEach块中引用RDD时，可以对RDD进行一些敏感的API调用，但是尚未弄清楚所有细节。

Spark FileStreaming无法与foreachRDD一起使用

问题描述

2 个解决方案

解决方案1
1 已采纳 2014-04-29 21:25:58

解决方案2
0 2014-10-25 19:45:33

Spark FileStreaming无法与foreachRDD一起使用

问题描述

2 个解决方案

解决方案1 1 已采纳 2014-04-29 21:25:58

解决方案2 0 2014-10-25 19:45:33

解决方案1
1 已采纳 2014-04-29 21:25:58

解决方案2
0 2014-10-25 19:45:33