简体   繁体   English

jupyter 笔记本上 spark wordcount python 3.5 中的 Py4JJavaError

[英]Py4JJavaError in spark wordcount python 3.5 on jupyter notebook

I am trying a simple network word count program on spark streaming in python with code as我正在尝试一个简单的网络字数计算程序,在 python 中使用代码作为火花流

from pyspark import SparkContext
from pyspark.streaming import StreamingContext


sc = SparkContext("local[2]", "NetworkWordCount")
ssc = StreamingContext(sc, 1)


lines = ssc.socketTextStream("localhost", 9999)

words = lines.flatMap(lambda line: line.split(" "))

pairs = words.map( lambda word : (word,1))
wordCount = pairs.reduceByKey( lambda x, y : (x+y))
wordCount.pprint()
ssc.start()
ssc.awaitTermination()

its running all good till ssc.start() but its giving an error at ssc.awaitTermination()它在ssc.start()之前一直运行良好,但在ssc.awaitTermination()处出现错误

Py4JJavaError                             Traceback (most recent call last)
<ipython-input-2-18f3db416f1c> in <module>()
      1 ssc.start()
----> 2 ssc.awaitTermination()

/usr/local/lib/python3.5/dist-packages/pyspark/streaming/context.py in awaitTermination(self, timeout)
    204         """
    205         if timeout is None:
--> 206             self._jssc.awaitTermination()
    207         else:
    208             self._jssc.awaitTerminationOrTimeout(int(timeout * 1000))

/usr/local/lib/python3.5/dist-packages/py4j/java_gateway.py in __call__(self, *args)
   1158         answer = self.gateway_client.send_command(command)
   1159         return_value = get_return_value(
-> 1160             answer, self.gateway_client, self.target_id, self.name)
   1161 
   1162         for temp_arg in temp_args:

/usr/local/lib/python3.5/dist-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    318                 raise Py4JJavaError(
    319                     "An error occurred while calling {0}{1}{2}.\n".
--> 320                     format(target_id, ".", name), value)
    321             else:
    322                 raise Py4JError(

Py4JJavaError: An error occurred while calling o22.awaitTermination.
: org.apache.spark.SparkException: An exception was raised by Python:
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/pyspark/streaming/util.py", line 65, in call
    r = self.func(t, *rdds)
  File "/usr/local/lib/python3.5/dist-packages/pyspark/streaming/dstream.py", line 171, in takeAndPrint
    taken = rdd.take(num + 1)
  File "/usr/local/lib/python3.5/dist-packages/pyspark/rdd.py", line 1358, in take
    res = self.context.runJob(self, takeUpToNumLeft, p)
  File "/usr/local/lib/python3.5/dist-packages/pyspark/context.py", line 1001, in runJob
    port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
  File "/usr/local/lib/python3.5/dist-packages/py4j/java_gateway.py", line 1160, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/usr/local/lib/python3.5/dist-packages/py4j/protocol.py", line 320, in get_return_value
    format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 (TID 1, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/home/blaze/spark/spark-2.2.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 123, in main
    ("%d.%d" % sys.version_info[:2], version))
Exception: Python in worker has different version 2.7 than that in driver 3.5, PySpark cannot run with different minor versions.Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.

    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
    at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
    at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:108)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1517)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1505)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1504)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1504)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
    at scala.Option.foreach(Option.scala:257)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1732)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1687)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1676)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2029)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2050)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2069)
    at org.apache.spark.api.python.PythonRDD$.runJob(PythonRDD.scala:455)
    at org.apache.spark.api.python.PythonRDD.runJob(PythonRDD.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:280)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:214)
    at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/home/blaze/spark/spark-2.2.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 123, in main
    ("%d.%d" % sys.version_info[:2], version))
Exception: Python in worker has different version 2.7 than that in driver 3.5, PySpark cannot run with different minor versions.Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.

    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
    at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
    at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:108)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    ... 1 more


    at org.apache.spark.streaming.api.python.TransformFunction.callPythonTransformFunction(PythonDStream.scala:95)
    at org.apache.spark.streaming.api.python.TransformFunction.apply(PythonDStream.scala:78)
    at org.apache.spark.streaming.api.python.PythonDStream$$anonfun$callForeachRDD$1.apply(PythonDStream.scala:179)
    at org.apache.spark.streaming.api.python.PythonDStream$$anonfun$callForeachRDD$1.apply(PythonDStream.scala:179)
    at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:51)
    at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:51)
    at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:51)
    at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:416)
    at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:50)
    at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:50)
    at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:50)
    at scala.util.Try$.apply(Try.scala:192)
    at org.apache.spark.streaming.scheduler.Job.run(Job.scala:39)
    at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:257)
    at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:257)
    at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:257)
    at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
    at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:256)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

spark version: 2.2.1 python version: 3.5 java version: 1.8.0_162 pyspark version: 2.3.0 spark版本:2.2.1 python版本:3.5 java版本:1.8.0_162 pyspark版本:2.3.0

Thanks.谢谢。

Are you using standalone spark?你在使用独立的火花吗?

Your error is : Exception: Python in worker has different version 2.7 than that in driver 3.5, PySpark cannot run with different minor versions.Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.您的错误是:异常:工作程序中的 Python 版本 2.7 与驱动程序 3.5 中的版本不同,PySpark 无法使用不同的次要版本运行。请检查环境变量 PYSPARK_PYTHON 和 PYSPARK_DRIVER_PYTHON 是否正确设置。

Your error has been addressed here : How do I set the driver's python version in spark?您的错误已在此处解决: How do I set the driver's python version in spark?

UPDATE THE SPARK ENVIRONMENT TO USE PYTHON 3.7:更新 Spark 环境以使用 Python 3.7:

Open a new terminal and type the following command: export PYSPARK_PYTHON=python3.7 This will ensure that the worker nodes use Python 3.7 (same as the Driver) and not the default Python 3.4打开一个新终端并输入以下命令: export PYSPARK_PYTHON=python3.7这将确保工作节点使用 Python 3.7(与驱动程序相同)而不是默认的 Python 3.4

DEPENDING ON VERSIONS OF PYTHON YOU HAVE, YOU MAY HAVE TO DO SOME INSTALL/UPDATE ANACONDA:根据您拥有的 Python 版本,您可能需要安装/更新 ANACONDA:

(To install see: https://www.digitalocean.com/community/tutorials/how-to-install-anaconda-on-ubuntu-18-04-quickstart ) (安装见: https : //www.digitalocean.com/community/tutorials/how-to-install-anaconda-on-ubuntu-18-04-quickstart

Make sure you have anaconda 4.1.0 or higher.确保您有 anaconda 4.1.0 或更高版本。 Open a new terminal and check your conda version by typing into a new terminal:打开一个新终端并通过输入新终端来检查您的 conda 版本:

conda --version

checking conda version检查 conda 版本

if you are below anaconda 4.1.0, type conda update conda如果您低于 anaconda 4.1.0,请输入conda update conda

  1. Next we check to see if we have the library nb_conda_kernels by typing接下来,我们通过键入来检查我们是否有库 nb_conda_kernels

conda list

Checking if we have nb_conda_kernels检查我们是否有nb_conda_kernels

  1. If you don't see nb_conda_kernels type如果你没有看到nb_conda_kernels类型

conda install nb_conda_kernels

Installing nb_conda_kernels安装nb_conda_kernels

  1. If you are using Python 2 and want a separate Python 3 environment please type the following如果您使用的是 Python 2 并且想要一个单独的 Python 3 环境,请输入以下内容

conda create -n py36 python=3.6 ipykernel

py35 is the name of the environment. py35 是环境的名称。 You could literally name it anything you want.你可以随意命名它。

Alternatively, If you are using Python 3 and want a separate Python 2 environment, you could type the following.或者,如果您使用 Python 3 并想要一个单独的 Python 2 环境,您可以键入以下内容。

conda create -n py27 python=2.7 ipykernel

py27 is the name of the environment. py27 是环境的名称。 It uses python 2.7.它使用 python 2.7。

  1. Ensure the versions of python are installed successfully and close the terminal.确保python版本安装成功并关闭终端。 Open a new terminal and type pyspark .打开一个新终端并输入pyspark You should see the new environments appearing.您应该会看到新环境出现。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Py4JJavaError:在Jupyter笔记本电脑中使用Pyspark尝试使用“ spark”运行示例。 - Py4JJavaError: Using Pyspark in Jupyter notebook trying to run examples using “spark.” Py4JJavaError:当我在Jupyter Notebook(Windows 10 OS)中执行计数功能时 - Py4JJavaError: When I exceute count function in jupyter notebook (windows 10 OS) 从Spark连接到Cassandra时出现Py4JJavaError - Py4JJavaError when connecting to Cassandra from Spark 我在 Py4JJavaError 上有问题:调用 z:org.apache.spark.api.python.PythonRDD.collectAndServe.over 时发生错误 - I have issue on Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.over Py4JJavaError:调用 z:org.apache.spark.api.python.PythonRDD.runJob 时发生错误。 ModuleNotFoundError: 没有名为“numpy”的模块 - Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob. ModuleNotFoundError: No module named 'numpy' 使用 count() 和 first() 时,iPython notebook 中的 PySpark 会引发 Py4JJavaError - PySpark in iPython notebook raises Py4JJavaError when using count() and first() Py4JJavaError:调用 None.org.apache.spark.api.java.JavaSparkContext 时出错 - Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext 为什么这些 Py4JJavaError 在使用 pyspark 加入 Spark 数据帧时会显示字符串错误? - Why these Py4JJavaError showString errors while joining Spark dataframes using pyspark? Pyspark(spark 1.6.x)ImportError:无法导入名称Py4JJavaError - Pyspark (spark 1.6.x) ImportError: cannot import name Py4JJavaError 使用 apache 火花大查询连接器时出现 Py4JJavaError - getting Py4JJavaError while using apache spark big query connector
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM