简体   繁体   English

Spark FileStreaming无法与foreachRDD一起使用

[英]Spark FileStreaming not Working with foreachRDD

I'm newbie to Spark, and i'm building a Small sample application which is a Spark fileStreaming one. 我是Spark的新手,并且正在构建一个Small示例应用程序,它是一个Spark fileStreaming应用程序。 All i wanted is to read the whole file in one go instead of reading line by line(i guess this is what the textFileStream does). 我想要的只是一次读取整个文件,而不是逐行读取(我想这就是textFileStream所做的事情)。

The code is below: 代码如下:

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.StreamingContext._

import org.apache.hadoop.io.LongWritable
import org.apache.hadoop.io.Text
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat

import scalax.io._

object SampleXML{

    def main(args: Array[String]){

        val logFile = "/home/akhld/mobi/spark-streaming/logs/sample.xml"

        val ssc = new StreamingContext("spark://localhost:7077","XML Streaming Job",Seconds(5),"/home/akhld/mobi/spark-streaming/spark-0.8.0-incubating",List("target/scala-2.9.3/simple-project_2.9.3-1.0.jar"))

        val lines = ssc.fileStream[LongWritable, Text, TextInputFormat]("/home/akhld/mobi/spark-streaming/logs/")

        lines.print()

        lines.foreachRDD(rdd => {
          rdd.count()  // prints counts

        })


        ssc.start()


    }
}

This code is failing with an exception saying that: 此代码失败,并显示以下异常:

[error] /home/akhld/mobi/spark-streaming/samples/samplexml/src/main/scala/SampleXML.scala:31: value foreachRDD is not a member of org.apache.spark.streaming.DStream[(org.apache.hadoop.io.LongWritable, org.apache.hadoop.io.Text)]
[error]         ssc.fileStream[LongWritable, Text, TextInputFormat]("/home/akhld/mobi/spark-streaming/logs/").foreachRDD(rdd => {
[error]                                                                                                       ^
[error] one error found
[error] (compile:compile) Compilation failed
[error] Total time: 3 s, completed Feb 3, 2014 7:32:57 PM

If this is not the right way of displaying the contents of the files in the stream, Please help me with an example. 如果这不是显示流中文件内容的正确方法,请帮我举个例子。 I searched a lot but couldn't find the proper one to use fileStream. 我搜索了很多,但找不到合适的文件来使用fileStream。

Well, textFileStream in the Spark Streaming is more meant to continuously read and process files that are being written inside a directory. 好吧,Spark Streaming中的textFileStream旨在继续读取和处理正在目录中写入的文件。 So if you have to process one file as a whole in one shot, it is simpler to use Spark directly! 因此,如果您必须一次处理一个文件,则直接使用Spark会更简单!

 val lines = sparkContext.textFile(<file URL>)
 lines.foreach(line => println(line))

This will print all the lines in the file. 这将打印文件中的所有行。

Also, I beleive the issue here is that you cannot call count() on an RDD inside a forEach stream. 另外,我相信这里的问题是您不能在forEach流中的RDD上调用count()。 The reason is that, if you do, I think it blocks progress of the forEach block - and the stream consumer stops working. 原因是,如果您这样做,我认为它会阻止forEach块的进度-并且流使用者停止工作。

I've created a JIRA for this https://issues.apache.org/jira/browse/SPARK-4040 . 我为此https://issues.apache.org/jira/browse/SPARK-4040创建了一个JIRA。

I think there are some sensitive API calls you can make on RDDs when you are referencing them in a forEach block , but havent quite worked out all the details yet. 我认为当在forEach块中引用RDD时,可以对RDD进行一些敏感的API调用,但是尚未弄清楚所有细节。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM