简体   繁体   English

Py4JJavaError:调用 z:org.apache.spark.api.python.PythonRDD.runJob 时出错。 : org.apache.spark.SparkException

[英]Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob. : org.apache.spark.SparkException

from pyspark import SparkConf,SparkContext
conf=SparkConf().setMaster("local").setAppName("my App")
sc=SparkContext(conf=conf)
lines = sc.textFile("C:/Users/user/Downloads/learning-spark-master/learning-spark-master/README.md")
pythonLines = lines.filter(lambda line: "Python" in line)
pythonLines
pythonLines.first()

I am new to pyspark.我是 pyspark 的新手。 I was trying to execute above code and I am getting following error after executing pythonLines().我试图执行上面的代码,执行 pythonLines() 后出现以下错误。 Any help would be appreciated.任何帮助,将不胜感激。

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob. Py4JJavaError:调用 z:org.apache.spark.api.python.PythonRDD.runJob 时出错。 : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 3) (LAPTOP-GAN836TE.fios-router.home executor driver): org.apache.spark.SparkException: Python worker failed to connect back. :org.apache.spark.SparkException:作业因阶段故障而中止:阶段 3.0 中的任务 0 失败 1 次,最近失败:阶段 3.0 中丢失任务 0.0(TID 3)(LAPTOP-GAN836TE.fios-router.home 执行程序驱动程序):org.apache.spark.SparkException:Python 工作人员无法连接回来。 at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:182) at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:107) at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:119) at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:145) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65) at org.apache.spark.rdd.RDD.computeOrReadChe at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:182) at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:107) at org.apache.spark.SparkEnv .createPythonWorker(SparkEnv.scala:119) at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:145) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65 ) 在 org.apache.spark.rdd.RDD.computeOrReadChe ckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.Thread ckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache .spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497) at org.apache.spark.util.Utils $.tryWithSafeFinally(Utils.scala:1439) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java .util.concurrent.Thread PoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.net.SocketTimeoutException: Accept timed out at java.net.DualStackPlainSocketImpl.waitForNewConnection(Native Method) at java.net.DualStackPlainSocketImpl.socketAccept(DualStackPlainSocketImpl.java:131) at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:535) at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:189) at java.net.ServerSocket.implAccept(ServerSocket.java:545) at java.net.ServerSocket.accept(ServerSocket.java:513) at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.ZBAAD2C48E66FBC14C PoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.net.SocketTimeoutException: Accept timed out at java.net.DualStackPlainSocketImpl.waitForNewConnection(Native Method ) at java.net.DualStackPlainSocketImpl.socketAccept(DualStackPlainSocketImpl.java:131) at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:535) at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:189) at java.net .ServerSocket.implAccept(ServerSocket.java:545) at java.net.ServerSocket.accept(ServerSocket.java:513) at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.ZBAAD2C48E66FBC14C 61337D0B2578221Z:174)... 14 more 61337D0B2578221Z:174)... 14 更多

Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2253) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2202) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2201) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2253) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2202) at org.apache. spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2201) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray .scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala :2201) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1078) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1078) at scala.Option.foreach(Option.scala:407) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1078) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2440) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2382) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.ZBAAD2C48E66FB :2201) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1078) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1078 ) at scala.Option.foreach(Option.scala:407) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1078) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala: 2440) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2382) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.ZBAAD2C48E66FB C14C61337D0B2578221Z:2371) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:868) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2202) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2223) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2242) at org.apache.spark.api.python.PythonRDD$.runJob(PythonRDD.scala:166) at org.apache.spark.api.python.PythonRDD.runJob(PythonRDD.scala) at sun.reflect C14C61337D0B2578221Z:2371) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:868) at org.apache .spark.SparkContext.runJob(SparkContext.scala:2202) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2223) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2242) at org. apache.spark.api.python.PythonRDD$.runJob(PythonRDD.scala:166) at org.apache.spark.api.python.PythonRDD.runJob(PythonRDD.scala) at sun.reflect .NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.Z93F7 .NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method .java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j .commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(螺纹.Z93F7 25A07423FE1C889F448B33D21F46Z:748) Caused by: org.apache.spark.SparkException: Python worker failed to connect back. 25A07423FE1C889F448B33D21F46Z:748) 原因:org.apache.spark.SparkException: Python 工作人员无法连接回来。 at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:182) at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:107) at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:119) at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:145) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65) at org.apache.spark.rdd.RDD.computeOrReadChe at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:182) at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:107) at org.apache.spark.SparkEnv .createPythonWorker(SparkEnv.scala:119) at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:145) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65 ) 在 org.apache.spark.rdd.RDD.computeOrReadChe ckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.Thread ckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache .spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497) at org.apache.spark.util.Utils $.tryWithSafeFinally(Utils.scala:1439) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java .util.concurrent.Thread PoolExecutor$Worker.run(ThreadPoolExecutor.java:624)... 1 more Caused by: java.net.SocketTimeoutException: Accept timed out at java.net.DualStackPlainSocketImpl.waitForNewConnection(Native Method) at java.net.DualStackPlainSocketImpl.socketAccept(DualStackPlainSocketImpl.java:131) at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:535) at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:189) at java.net.ServerSocket.implAccept(ServerSocket.java:545) at java.net.ServerSocket.accept(ServerSocket.java:513) at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:174)... 14 more PoolExecutor$Worker.run(ThreadPoolExecutor.java:624)... 1 more Caused by: java.net.SocketTimeoutException: Accept timed out at java.net.DualStackPlainSocketImpl.waitForNewConnection(Native Method) at java.net.DualStackPlainSocketImpl.socketAccept( DualStackPlainSocketImpl.java:131) at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:535) at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:189) at java.net.ServerSocket.implAccept(ServerSocket.java:545 ) at java.net.ServerSocket.accept(ServerSocket.java:513) at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:174)... 14 more

Based on the code, am not seeing anything wrong.根据代码,我没有看到任何错误。 Still you can analysis this issue based on the following data related.您仍然可以根据以下相关数据分析此问题。

  • Make sure 4th line lines rdd has the data based on the collect() .确保第 4 行lines rdd 具有基于collect()的数据。

  • make your after filter line #5, you are not getting empty rdd by using of isEmpty() .使您的后过滤器行#5,您不会通过使用isEmpty()得到空的rdd ref: link参考: 链接

Same code I have ran for your reference as sample.我运行的相同代码作为示例供您参考。

在此处输入图像描述

I ran into the same error in Chapter 7 in the "Data Science on GCP" book by author Valliappa Lakshmanan.我在作者 Valliappa Lakshmanan 的“GCP 上的数据科学”一书的第 7 章中遇到了同样的错误。

The author points this out in one of the logistic_regression.ipynb cells by writing "if this is empty, change the shard you are using", but it's not clear that above error could be an indication of that.作者在其中一个logistic_regression.ipynb单元格中指出了这一点,写了“如果这是空的,请更改您正在使用的分片”,但尚不清楚上述错误是否表明了这一点。

Following their tip, simply change按照他们的提示,只需更改


inputs = 'gs://{}/flights/tzcorr/all_flights-00000-*'.format(BUCKET)

to something like (note the 1 instead of the 0 to select a different shard)到类似的东西(注意 1 而不是 0 到 select 不同的分片)


inputs = 'gs://{}/flights/tzcorr/all_flights-00001-*'.format(BUCKET)

You'd have to make an equivalent change further down to not test the model on the same data as you trained it on.您必须进一步进行等效更改,才能不对 model 进行与您训练的数据相同的测试。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Py4JJavaError:调用 z:org.apache.spark.api.python.PythonRDD.collectAndServe 时发生错误 - Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe Py4JJavaError:调用z:org.apache.spark.api.python.PythonRDD.collectAndServe时发生错误。 :java.lang.IllegalArgumentException - Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : java.lang.IllegalArgumentException Py4JJavaError:调用 o57.sql 时出错:org.apache.spark.SparkException:作业中止 - Py4JJavaError: An error occurred while calling o57.sql.: org.apache.spark.SparkException: Job aborted py4j.protocol.Py4JJavaError:调用z:org.apache.spark.api.python.PythonRDD.collectAndServe时发生错误 - py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe 调用 z:org.apache.spark.api.python.PythonRDD.collectAndServe 时发生 py4j.protocol.Py4JJavaError - py4j.protocol.Py4JJavaError occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe Py4JJavaError:调用 None.org.apache.spark.api.java.JavaSparkContext 时出错 - Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext 调用 z:org.apache.spark.api.python.PythonRDD.collectAndServe 时出错 - An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe Py4JJavaError: . : org.apache.spark.SparkException: 无效的 Spark URL: spark://HeartbeatReceiver@kshitij_computer.mshome.net:50675 - Py4JJavaError: . : org.apache.spark.SparkException: Invalid Spark URL: spark://HeartbeatReceiver@kshitij_computer.mshome.net:50675 org.apache.spark.SparkException:任务不可序列化 - org.apache.spark.SparkException: Task not serializable org.apache.spark.SparkException:此 RDD 缺少 SparkContext 错误 - org.apache.spark.SparkException: This RDD lacks a SparkContext error
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM