简体   繁体   English

如何在Spark中为python 3.5安装numpy和pandas?

[英]How do I install numpy and pandas for Python 3.5 in Spark?

I am trying to run a linear regression in Spark using Python 3.5 instead of Python 2.7. 我正在尝试使用Python 3.5而不是Python 2.7在Spark中运行线性回归。 So first I exported PYSPARK_PHTHON=python3. 所以首先我导出了PYSPARK_PHTHON = python3。 I received an error "No module named numpy". 我收到错误“没有名为numpy的模块”。 I tried to "pip install numpy" but pip doesn't recognize the setting PYSPARK_PYTHON. 我尝试“点安装numpy”,但点无法识别设置PYSPARK_PYTHON。 How to I ask pip to install numpy for 3.5? 我如何要求pip安装numpy for 3.5? Thank you ... 谢谢 ...

$ export PYSPARK_PYTHON=python3

$ spark-submit linreg.py
....
Traceback (most recent call last):
  File "/home/yoda/Code/idenlink-examples/test22-spark-linreg/linreg.py", line 115, in <module>
from pyspark.ml.linalg import Vectors
  File "/home/yoda/install/spark/python/lib/pyspark.zip/pyspark/ml/__init__.py", line 22, in <module>
  File "/home/yoda/install/spark/python/lib/pyspark.zip/pyspark/ml/base.py", line 21, in <module>
  File "/home/yoda/install/spark/python/lib/pyspark.zip/pyspark/ml/param/__init__.py", line 26, in <module>
  ImportError: No module named 'numpy'

$ pip install numpy
Requirement already satisfied: numpy in /home/yoda/.local/lib/python2.7/site-packages

$ pyspark
Python 3.5.2 (default, Nov 17 2016, 17:05:23) 
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
17/02/09 20:29:20 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/02/09 20:29:20 WARN Utils: Your hostname, yoda-VirtualBox resolves to a loopback address: 127.0.1.1; using 10.0.2.15 instead (on interface enp0s3)
17/02/09 20:29:20 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
17/02/09 20:29:31 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.1.0
      /_/

Using Python version 3.5.2 (default, Nov 17 2016 17:05:23)
SparkSession available as 'spark'.
>>> import site; site.getsitepackages()
['/usr/local/lib/python3.5/dist-packages', '/usr/lib/python3/dist-packages', '/usr/lib/python3.5/dist-packages']
>>> 

So I don't actually see this as a spark question at all. 因此,我实际上根本不认为这是一个火花问题。 It looks to me like you need help with environments. 在我看来,您需要环境方面的帮助。 As the commenter mentioned you need to setup a python 3 environment, activate it, and then install numpy. 正如评论者所提到的,您需要设置一个python 3环境,将其激活,然后安装numpy。 Take a look at this for a little help on working with environments. 看看有关使用环境中工作的一点点帮助。 After setting up a python3 environment you should activate it and then run pip install numpy or conda install numpy and you should be good to go. 设置python3环境后,您应该将其激活,然后运行pip install numpyconda install numpy ,您应该一切顺利。

If you are running job local you just need to upgrade pyspark 如果您在local运行作业,则只需升级pyspark

Homebrew: brew upgrade pyspark this should solve most of the dependencies. 自制: brew upgrade pyspark这应该可以解决大多数依赖性。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM