I'm trying to run spark using pyspark
on python 3
with ubuntu 18.04
, however, I'm getting a bunch of different errors that I don't know how to handle. I'm using Java 10 jdk
and my JAVA_HOME
variable is already set.
this is the code I'm trying to run on python
:
import sys
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
sc = SparkContext(appName="PysparkStreaming")
ssc = StreamingContext(sc, 3) #Streaming will execute in each 3 seconds
lines = ssc.textFileStream('/home/mabarberan/Escritorio/prueba spark/') #'log/ mean directory name
counts = lines.flatMap(lambda line: line.split(" ")) \
.map(lambda x: (x, 1)) \
.reduceByKey(lambda a, b: a + b)
counts.pprint()
ssc.start()
ssc.awaitTermination()
and these are the errors I'm getting:
/home/mabarberan/anaconda3/bin/python /home/mabarberan/Descargas/carpeta.py
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.hadoop.security.authentication.util.KerberosUtil (file:/home/mabarberan/anaconda3/lib/python3.6/site-packages/pyspark/jars/hadoop-auth-2.7.3.jar) to method sun.security.krb5.Config.getInstance()
WARNING: Please consider reporting this to the maintainers of org.apache.hadoop.security.authentication.util.KerberosUtil
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
2018-06-21 12:53:07 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
[Stage 1:> (0 + 1) / 1]2018-06-21 12:53:13 ERROR PythonRunner:91 - Python worker exited unexpectedly (crashed)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/home/mabarberan/anaconda3/lib/python3.6/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 176, in main
("%d.%d" % sys.version_info[:2], version))
Exception: Python in worker has different version 2.7 than that in driver 3.6, PySpark cannot run with different minor versions.Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.
I can't paste all the error code here, but this is what comes next:
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:844)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/home/mabarberan/anaconda3/lib/python3.6/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 176, in main
("%d.%d" % sys.version_info[:2], version))
Exception: Python in worker has different version 2.7 than that in driver 3.6, PySpark cannot run with different minor versions.Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.
2018-06-21 12:53:13 ERROR TaskSetManager:70 - Task 0 in stage 1.0 failed 1 times; aborting job
2018-06-21 12:53:13 ERROR JobScheduler:91 - Error running job streaming job 1529578392000 ms.0
org.apache.spark.SparkException: An exception was raised by Python:
Traceback (most recent call last):
File "/home/mabarberan/anaconda3/lib/python3.6/site-packages/pyspark/streaming/util.py", line 65, in call
r = self.func(t, *rdds)
File "/home/mabarberan/anaconda3/lib/python3.6/site-packages/pyspark/streaming/dstream.py", line 171, in takeAndPrint
taken = rdd.take(num + 1)
File "/home/mabarberan/anaconda3/lib/python3.6/site-packages/pyspark/rdd.py", line 1375, in take
res = self.context.runJob(self, takeUpToNumLeft, p)
File "/home/mabarberan/anaconda3/lib/python3.6/site-packages/pyspark/context.py", line 1013, in runJob
sock_info = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
File "/home/mabarberan/anaconda3/lib/python3.6/site-packages/py4j/java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/home/mabarberan/anaconda3/lib/python3.6/site-packages/py4j/protocol.py", line 328, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 0, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/home/mabarberan/anaconda3/lib/python3.6/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 176, in main
("%d.%d" % sys.version_info[:2], version))
Exception: Python in worker has different version 2.7 than that in driver 3.6, PySpark cannot run with different minor versions.Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.
Process finished with exit code 1
I have googled the errors and saw them individually occurring to other people, but seems I have all of them in the same time, I tried some fixes I found on the web but they didn't seem to work, so I'm stuck. I would appreciate any help on this.
Thanks in advance
You are starting master with anaconda python and workers are using default python. You can remove anaconda path from .bashrc or set PYSPARK_PYTHON and SPARK_DRIVER_PYTHON variables in spark-env to python path you want to use. For examples add lines below to $SPARK_HOME/conf/spark-env.sh
export PYSPARK_PYTHON=/usr/bin/python3
export SPARK_DRIVER_PYTHON=/usr/bin/python3
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.