简体   繁体   English

Spark Streaming 异常:java.util.NoSuchElementException:None.get

[英]Spark Streaming Exception: java.util.NoSuchElementException: None.get

I am writing SparkStreaming data to HDFS by converting it to a dataframe:我通过将SparkStreaming数据转换为数据帧将其写入 HDFS:

Code代码

object KafkaSparkHdfs {

  val sparkConf = new SparkConf().setMaster("local[*]").setAppName("SparkKafka")
  sparkConf.set("spark.driver.allowMultipleContexts", "true");
  val sc = new SparkContext(sparkConf)

  def main(args: Array[String]): Unit = {
    val sqlContext = new org.apache.spark.sql.SQLContext(sc)
    import sqlContext.implicits._

    val ssc = new StreamingContext(sparkConf, Seconds(20))

    val kafkaParams = Map[String, Object](
      "bootstrap.servers" -> "localhost:9092",
      "key.deserializer" -> classOf[StringDeserializer],
      "value.deserializer" -> classOf[StringDeserializer],
      "group.id" -> "stream3",
      "auto.offset.reset" -> "latest",
      "enable.auto.commit" -> (false: java.lang.Boolean)
    )

    val topics = Array("fridaydata")
    val stream = KafkaUtils.createDirectStream[String, String](
      ssc, PreferConsistent, Subscribe[String, String](topics, kafkaParams)
    )

    val lines = stream.map(consumerRecord => consumerRecord.value)
    val words = lines.flatMap(_.split(" "))
    val wordMap = words.map(word => (word, 1))
    val wordCount = wordMap.reduceByKey(_ + _)

    wordCount.foreachRDD(rdd => {
      val dataframe = rdd.toDF(); 
      dataframe.write
        .mode(SaveMode.Append)
        .save("hdfs://localhost:9000/newfile24")     
    })

    ssc.start()
    ssc.awaitTermination()
  }
}

The folder is created but the file is not written.文件夹已创建,但文件未写入。

The program is getting terminated with the following error:程序因以下错误而终止:

    18/06/22 16:14:41 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
    java.util.NoSuchElementException: None.get
    at scala.None$.get(Option.scala:347)
    at scala.None$.get(Option.scala:345)
    at org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)
    at org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:670)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:289)
    at java.lang.Thread.run(Thread.java:748)
    18/06/22 16:14:41 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): java.util.NoSuchElementException: None.get
    at scala.None$.get(Option.scala:347)
    at scala.None$.get(Option.scala:345)
    at org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)
    at org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:670)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:289)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

In my pom I am using respective dependencies:在我的 pom 中,我使用了各自的依赖项:

  • spark-core_2.11 spark-core_2.11
  • spark-sql_2.11 spark-sql_2.11
  • spark-streaming_2.11 spark-streaming_2.11
  • spark-streaming-kafka-0-10_2.11 spark-streaming-kafka-0-10_2.11

The error is due to trying to run multiple spark contexts at the same time.该错误是由于尝试同时运行多个 spark 上下文。 Setting allowMultipleContexts to true is mostly used for testing purposes and it's use is discouraged.allowMultipleContexts设置为 true 主要用于测试目的,不鼓励使用。 The solution to your problem is therefore to make sure that the same SparkContext is used everywhere.因此,解决您的问题的方法是确保在任何地方都使用相同的SparkContext From the code we can see that the SparkContext ( sc ) is used to create a SQLContext which is fine.从代码中我们可以看到SparkContext ( sc ) 用于创建一个SQLContext However, when creating the StreamingContext it is not used, instead the SparkConf is used.但是,在创建StreamingContext时不使用它,而是使用SparkConf

By looking at the documentation we see:通过查看文档,我们看到:

Create a StreamingContext by providing the configuration necessary for a new SparkContext通过提供新 SparkContext 所需的配置来创建 StreamingContext

In other words, by using SparkConf as parameter a new SparkContext will be created.换句话说,通过使用SparkConf作为参数,将创建一个新的SparkContext Now there are two separate contexts.现在有两个独立的上下文。

The easiest solution here would be to continue using the same context as before.这里最简单的解决方案是继续使用与以前相同的上下文。 Change the line creating the StreamingContext to:将创建StreamingContext的行更改为:

val ssc = new StreamingContext(sc, Seconds(20))

Note: In newer versions of Spark (2.0+) use SparkSession instead.注意:在较新版本的 Spark (2.0+) 中,请改用SparkSession A new streaming context can then be created using StreamingContext(spark.sparkContext, ...) .然后可以使用StreamingContext(spark.sparkContext, ...)创建一个新的流上下文。 It can look as follows:它可以如下所示:

val spark = SparkSession().builder
  .setMaster("local[*]")
  .setAppName("SparkKafka")
  .getOrCreate()

import sqlContext.implicits._
val ssc = new StreamingContext(spark.sparkContext, Seconds(20))

There is an obvious problem here - coalesce(1) .这里有一个明显的问题coalesce(1)

dataframe.coalesce(1)

While reducing number of files might be tempting in many scenarios, it should be done if and only if it amount of data is low enough for nodes to handle (clearly it isn't here).虽然在许多情况下减少文件数量可能很诱人,但当且仅当数据量足够低以供节点处理时才应该这样做(显然它不在这里)。

Furthermore, let me quote the documentation :此外,让我引用文档

However, if you're doing a drastic coalesce, eg to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (eg one node in the case of numPartitions = 1).但是,如果您进行了剧烈的合并,例如将 numPartitions = 1,这可能会导致您在比您喜欢的节点数更少的节点上进行计算(例如,在 numPartitions = 1 的情况下只有一个节点)。 To avoid this, you can call repartition.为了避免这种情况,您可以调用重新分区。 This will add a shuffle step, but means the current upstream partitions will be executed in parallel (per whatever the current partitioning is).这将添加一个 shuffle 步骤,但意味着当前的上游分区将并行执行(无论当前分区是什么)。

The conclusion is you should adjust the parameter accordingly to the expected amount of data and desired parallelism.结论是您应该根据预期的数据量和所需的并行度相应地调整参数。 coalesce(1) as such is rarely useful in practice, especially in a context like streaming, where data properties can differ over time.像这样的coalesce(1)在实践中很少有用,特别是在像流这样的上下文中,其中数据属性可能会随着时间的推移而不同。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Spark streaming 2.11 - java.util.NoSuchElementException:执行SQL函数时出现None.get错误 - Spark streaming 2.11 - java.util.NoSuchElementException: None.get error when executing SQL function Spark Streaming有状态转换mapWithState函数获取错误java.util.NoSuchElementException:None.get - Spark Streaming stateful transformation mapWithState function getting error java.util.NoSuchElementException: None.get 异常 java.util.NoSuchElementException: None.get in Spark Dataset save() 操作 - Exception java.util.NoSuchElementException: None.get in Spark Dataset save() operation Greenplum-Spark-Connector java.util.NoSuchElementException: None.get - Greenplum-Spark-Connector java.util.NoSuchElementException: None.get Dataproc 集群中的 Scala Spark 作业返回 java.util.NoSuchElementException: None.get - Scala Spark Job in Dataproc cluster returns java.util.NoSuchElementException: None.get java.util.NoSuchElementException的修补程序是什么:运行Spark应用程序将数据移入HDFS时出现的None.get? - What is the fix for java.util.NoSuchElementException: None.get which comes while running a spark application to move data into HDFS? javax.servlet.ServletException:java.util.NoSuchElementException:None.get - javax.servlet.ServletException: java.util.NoSuchElementException: None.get 由以下原因引起:java.util.NoSuchElementException:在结构化流中在流静态联接之前使用聚合时发生None.get错误 - Getting Caused by: java.util.NoSuchElementException: None.get error when I use aggregate before stream-static join in structured streaming spark scala throws java.util.NoSuchElementException:key not found:0 exception - spark scala throws java.util.NoSuchElementException: key not found: 0 exception 用户类抛出异常:java.util.NoSuchElementException:spark.driver.memory - User class threw exception: java.util.NoSuchElementException: spark.driver.memory
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM