简体   繁体   English

Spark Streaming空RDD问题

[英]Spark Streaming Empty RDD Issue

I am trying to create custom stream receiver from RDBMS. 我正在尝试从RDBMS创建自定义流接收器。

val dataDStream = ssc.receiverStream(new inputReceiver ())
  dataDStream.foreachRDD((rdd:RDD[String],time:Time)=> {
    val newdata=rdd.flatMap(x=>x.split(","))
    newdata.foreach(println)  // *******This line has problem, newdata has no records
  })

ssc.start()
ssc.awaitTermination()
}

class inputReceiver extends Receiver[String](StorageLevel.MEMORY_AND_DISK_2) with Logging {
  def onStart() {
    // Start the thread that receives data over a connection
    new Thread("RDBMS data Receiver") {
      override def run() {
        receive()
      }
    }.start()
  }
  def onStop() {
  }

  def receive() {
    val sqlcontext = SQLContextSingleton.getInstance()

    // **** I am assuming something wrong in following code
    val DF = sqlcontext.read.json("/home/cloudera/data/s.json")
    for (data <- rdd) {
      store(data.toString())
    }
    logInfo("Stopped receiving")
    restart("Trying to connect again")
  }
}

The code is executing without error, but is not printing any record from dataframe. 代码正在执行,没有错误,但是没有从数据框中打印任何记录。

I am using Spark 1.6 and Scala 我正在使用Spark 1.6和Scala

To make your code work you should change the following: 为了使代码正常工作,您应该更改以下内容:

def receive() {
  val sqlcontext = SQLContextSingleton.getInstance()
  val DF = sqlcontext.read.json("/home/cloudera/data/s.json")

  // **** this:
  rdd.collect.foreach(data => store(data.toString()))

  logInfo("Stopped receiving")
  restart("Trying to connect again")
}

HOWEVER this is not advisable because all of the data in your json file will be processed by the driver, and the Receiver hasn't been given proper consideration to reliability. 但是,这是不可取的,因为您的json文件中的所有数据都将由驱动程序处理,并且未对Receiver的可靠性进行适当考虑。

I suspect that Spark Streaming isn't the right fit for your use-case. 我怀疑Spark Streaming不适合您的用例。 Reading between the lines, it seems either you are streaming and as such need a proper producer, or you are reading data dumped from RDBMS into json, in which case you don't need Spark Streaming. 在两行之间读取,似乎您正在流式传输,因此需要适当的生产者,或者您正在读取从RDBMS转储到json中的数据,在这种情况下,您不需要Spark流式传输。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM