简体   繁体   中英

Several errors while trying to run Spark with python 3

I'm trying to run spark using pyspark on python 3 with ubuntu 18.04 , however, I'm getting a bunch of different errors that I don't know how to handle. I'm using Java 10 jdk and my JAVA_HOME variable is already set.

this is the code I'm trying to run on python :

import sys
from pyspark import SparkContext
from pyspark.streaming import StreamingContext



sc = SparkContext(appName="PysparkStreaming")
ssc = StreamingContext(sc, 3)   #Streaming will execute in each 3 seconds
lines = ssc.textFileStream('/home/mabarberan/Escritorio/prueba spark/')  #'log/ mean directory name
counts = lines.flatMap(lambda line: line.split(" ")) \
    .map(lambda x: (x, 1)) \
    .reduceByKey(lambda a, b: a + b)
counts.pprint()
ssc.start()
ssc.awaitTermination()

and these are the errors I'm getting:

/home/mabarberan/anaconda3/bin/python /home/mabarberan/Descargas/carpeta.py
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.hadoop.security.authentication.util.KerberosUtil (file:/home/mabarberan/anaconda3/lib/python3.6/site-packages/pyspark/jars/hadoop-auth-2.7.3.jar) to method sun.security.krb5.Config.getInstance()
WARNING: Please consider reporting this to the maintainers of org.apache.hadoop.security.authentication.util.KerberosUtil
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
2018-06-21 12:53:07 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
[Stage 1:>                                                          (0 + 1) / 1]2018-06-21 12:53:13 ERROR PythonRunner:91 - Python worker exited unexpectedly (crashed)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/home/mabarberan/anaconda3/lib/python3.6/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 176, in main
    ("%d.%d" % sys.version_info[:2], version))
Exception: Python in worker has different version 2.7 than that in driver 3.6, PySpark cannot run with different minor versions.Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.

I can't paste all the error code here, but this is what comes next:

at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:109)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:844)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/home/mabarberan/anaconda3/lib/python3.6/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 176, in main
    ("%d.%d" % sys.version_info[:2], version))
Exception: Python in worker has different version 2.7 than that in driver 3.6, PySpark cannot run with different minor versions.Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.

    2018-06-21 12:53:13 ERROR TaskSetManager:70 - Task 0 in stage 1.0 failed 1 times; aborting job
2018-06-21 12:53:13 ERROR JobScheduler:91 - Error running job streaming job 1529578392000 ms.0
org.apache.spark.SparkException: An exception was raised by Python:
Traceback (most recent call last):
  File "/home/mabarberan/anaconda3/lib/python3.6/site-packages/pyspark/streaming/util.py", line 65, in call
    r = self.func(t, *rdds)
  File "/home/mabarberan/anaconda3/lib/python3.6/site-packages/pyspark/streaming/dstream.py", line 171, in takeAndPrint
    taken = rdd.take(num + 1)
  File "/home/mabarberan/anaconda3/lib/python3.6/site-packages/pyspark/rdd.py", line 1375, in take
    res = self.context.runJob(self, takeUpToNumLeft, p)
  File "/home/mabarberan/anaconda3/lib/python3.6/site-packages/pyspark/context.py", line 1013, in runJob
    sock_info = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
  File "/home/mabarberan/anaconda3/lib/python3.6/site-packages/py4j/java_gateway.py", line 1257, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/home/mabarberan/anaconda3/lib/python3.6/site-packages/py4j/protocol.py", line 328, in get_return_value
    format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 0, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/home/mabarberan/anaconda3/lib/python3.6/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 176, in main
    ("%d.%d" % sys.version_info[:2], version))
Exception: Python in worker has different version 2.7 than that in driver 3.6, PySpark cannot run with different minor versions.Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.

Process finished with exit code 1

I have googled the errors and saw them individually occurring to other people, but seems I have all of them in the same time, I tried some fixes I found on the web but they didn't seem to work, so I'm stuck. I would appreciate any help on this.

Thanks in advance

You are starting master with anaconda python and workers are using default python. You can remove anaconda path from .bashrc or set PYSPARK_PYTHON and SPARK_DRIVER_PYTHON variables in spark-env to python path you want to use. For examples add lines below to $SPARK_HOME/conf/spark-env.sh

export PYSPARK_PYTHON=/usr/bin/python3
export SPARK_DRIVER_PYTHON=/usr/bin/python3

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM