简体   繁体   中英

How do I install numpy and pandas for Python 3.5 in Spark?

I am trying to run a linear regression in Spark using Python 3.5 instead of Python 2.7. So first I exported PYSPARK_PHTHON=python3. I received an error "No module named numpy". I tried to "pip install numpy" but pip doesn't recognize the setting PYSPARK_PYTHON. How to I ask pip to install numpy for 3.5? Thank you ...

$ export PYSPARK_PYTHON=python3

$ spark-submit linreg.py
....
Traceback (most recent call last):
  File "/home/yoda/Code/idenlink-examples/test22-spark-linreg/linreg.py", line 115, in <module>
from pyspark.ml.linalg import Vectors
  File "/home/yoda/install/spark/python/lib/pyspark.zip/pyspark/ml/__init__.py", line 22, in <module>
  File "/home/yoda/install/spark/python/lib/pyspark.zip/pyspark/ml/base.py", line 21, in <module>
  File "/home/yoda/install/spark/python/lib/pyspark.zip/pyspark/ml/param/__init__.py", line 26, in <module>
  ImportError: No module named 'numpy'

$ pip install numpy
Requirement already satisfied: numpy in /home/yoda/.local/lib/python2.7/site-packages

$ pyspark
Python 3.5.2 (default, Nov 17 2016, 17:05:23) 
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
17/02/09 20:29:20 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/02/09 20:29:20 WARN Utils: Your hostname, yoda-VirtualBox resolves to a loopback address: 127.0.1.1; using 10.0.2.15 instead (on interface enp0s3)
17/02/09 20:29:20 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
17/02/09 20:29:31 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.1.0
      /_/

Using Python version 3.5.2 (default, Nov 17 2016 17:05:23)
SparkSession available as 'spark'.
>>> import site; site.getsitepackages()
['/usr/local/lib/python3.5/dist-packages', '/usr/lib/python3/dist-packages', '/usr/lib/python3.5/dist-packages']
>>> 

So I don't actually see this as a spark question at all. It looks to me like you need help with environments. As the commenter mentioned you need to setup a python 3 environment, activate it, and then install numpy. Take a look at this for a little help on working with environments. After setting up a python3 environment you should activate it and then run pip install numpy or conda install numpy and you should be good to go.

If you are running job local you just need to upgrade pyspark

Homebrew: brew upgrade pyspark this should solve most of the dependencies.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM