繁体   English   中英

PySpark RuntimeError:设置迭代过程中更改的大小

[英]PySpark RuntimeError: Set changed size during iteration

我正在运行pyspark脚本,并在下面遇到错误。 由于我的代码“ if len(rdd.take(1))> 0:”,似乎在说“ RuntimeError:在迭代过程中设置更改的大小”。 我不确定这是否是真正的原因,想知道到底出了什么问题。 任何帮助将不胜感激。

谢谢!

 17/03/23 21:54:17 INFO DStreamGraph: Updated checkpoint data for time 1490320070000 ms 17/03/23 21:54:17 INFO JobScheduler: Finished job streaming job 1490320072000 ms.0 from job set of time 1490320072000 ms 17/03/23 21:54:17 INFO JobScheduler: Starting job streaming job 1490320072000 ms.1 from job set of time 1490320072000 ms 17/03/23 21:54:17 ERROR JobScheduler: Error running job streaming job 1490320072000 ms.0 org.apache.spark.SparkException: An exception was raised by Python: Traceback (most recent call last): File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/streaming/util.py", 

第65行,在调用r = self.func(t,* rdds)文件“ /usr/lib/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py”的第159行中,在func = lambda t ,rdd:old_func(rdd)文件“ /home/richard/Documents/spark_code/with_kafka/./mongo_kafka_spark_script.py”第96行,如果len(rdd.take(1))> 0:则文件_compute_glb_max:文件“ / usr / lib / spark / python / lib / pyspark.zip / pyspark / rdd.py“,行1343,在take res = self.context.runJob(self,takeUpToNumLeft,p)文件” / usr / lib / spark / python / lib /pyspark.zip/pyspark/context.py“,行965,在runJob端口= self._jvm.PythonRDD.runJob(self._jsc.sc(),mappedRDD._jrdd,分区)文件” / usr / lib / spark / python / lib / pyspark.zip / pyspark / rdd.py”,第2439行,位于_jrdd self._jrdd_deserializer,分析器中)文件“ /usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py”,第2372行,在_wrap_function pickled_command中,broadcast_vars,env,包含= _prepare_for_python_RDD(sc,command)文件“ /usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py”,第2363行,在_prepare_for_python_RDD中 vars = [sc._pickled_broadcast_vars中x的x._jbroadcast] RuntimeError:设置迭代期间更改的大小

  at org.apache.spark.streaming.api.python.TransformFunction.callPythonTransformFunction(PythonDStream.scala:95) at org.apache.spark.streaming.api.python.TransformFunction.apply(PythonDStream.scala:78) at org.apache.spark.streaming.api.python.PythonDStream$$anonfun$callForeachRDD$1.apply(PythonDStream.scala:179) at org.apache.spark.streaming.api.python.PythonDStream$$anonfun$callForeachRDD$1.apply(PythonDStream.scala:179) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:51) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:51) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:51) at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:50) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:50) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:50) at scala.util.Try$.apply(Try.scala:192) at org.apache.spark.streaming.scheduler.Job.run(Job.scala:39) at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:254) at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:254) at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:254) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:253) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Traceback (most recent call last): File "/home/richard/Documents/spark_code/with_kafka/./mongo_kafka_spark_script.py", 

ssc.awaitTermination()中的第224行; 文件“ /usr/lib/spark/python/lib/pyspark.zip/pyspark/streaming/context.py”,行206,位于awaitTermination文件中,文件“ /usr/lib/spark/python/lib/py4j-0.10.4- src.zip/py4j/java_gateway.py“,行1133,在调用文件“ /usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py”,行63,在deco文件“ / usr / lib / spark / python / lib / py4j-0.10.4-src.zip / py4j / protocol.py“,第319行,位于get_return_value py4j.protocol.Py4JJavaError中:调用o38.awaitTermination时发生错误。 :org.apache.spark.SparkException:Python引发了一个异常:回溯(最近一次调用最后一次):文件“ /usr/lib/spark/python/lib/pyspark.zip/pyspark/streaming/util.py”,第65行,在调用r = self.func(t,* rdds)文件“ /usr/lib/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py”的第159行中,在func = lambda t ,rdd:old_func(rdd)文件“ /home/richard/Documents/spark_code/with_kafka/./mongo_kafka_spark_script.py”第96行,如果len(rdd.take(1))> 0:则文件_compute_glb_max:文件“ / usr / lib / spark / python / lib / pyspark.zip / pyspark / rdd.py“,行1343,在take res = self.context.runJob(self,takeUpToNumLeft,p)文件” / usr / lib / spark / python / lib /pyspark.zip/pyspark/context.py“,行965,在runJob端口= self._jvm.PythonRDD.runJob(self._jsc.sc(),mappedRDD._jrdd,分区)文件” / usr / lib / spark / python / lib / pyspark.zip / pyspark / rdd.py”,第2439行,位于_jrdd self._jrdd_deserializer,分析器中)文件“ /usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py”,第2372行,在_wrap_function pickled_command中,b roadcast_vars,env,包含= _prepare_for_python_RDD(sc,command)文件“ /usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py”,第2363行,在_prepare_for_python_RDD中,广播_vars = [x._jbroadcast for x in sc._pickled_broadcast_vars] RuntimeError:设置迭代过程中更改的大小

  at org.apache.spark.streaming.api.python.TransformFunction.callPythonTransformFunction(PythonDStream.scala:95) at org.apache.spark.streaming.api.python.TransformFunction.apply(PythonDStream.scala:78) at org.apache.spark.streaming.api.python.PythonDStream$$anonfun$callForeachRDD$1.apply(PythonDStream.scala:179) at org.apache.spark.streaming.api.python.PythonDStream$$anonfun$callForeachRDD$1.apply(PythonDStream.scala:179) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:51) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:51) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:51) at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:50) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:50) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:50) at scala.util.Try$.apply(Try.scala:192) at org.apache.spark.streaming.scheduler.Job.run(Job.scala:39) at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:254) at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:254) at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:254) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:253) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 

在迭代之间创建广播变量似乎不是最佳实践。 如果需要有状态数据,请尽可能使用updateStateByKey。

尝试

if rdd.count() <1 :

take()可以给出异常,但是,如果有更多详细信息可用,我们可以查明错误。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM