简体   繁体   English

我需要总是运行 findspark 还是一次?

[英]Do I need to run always findspark or once?

My method of using pyspark is to always run the code below in jupyter.我使用 pyspark 的方法是始终在 jupyter 中运行以下代码。 Is this method always necessary?这种方法总是必要的吗?

import findspark
findspark.init('/opt/spark2.4')
import pyspark
sc = pyspark.SparkContext()

If you want to reduce the findspark dependency, you can just make sure you have these variables in your .bashrc如果你想减少对findspark的依赖,你可以确保你的.bashrc中有这些变量

export SPARK_HOME='/opt/spark2.4'
export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
export PYSPARK_PYTHON=python3
export PATH=$SPARK_HOME:$PATH:~/.local/bin:$JAVA_HOME/bin:$JAVA_HOME/jre/bin

Change the directories according to your enviroment, and the spark version as well.根据您的环境以及 spark 版本更改目录。 Apart from that, findspark will have to be in your code for your python interpreter to find the spark directory除此之外, findspark必须在您的 python 解释器的代码中才能找到 spark 目录

If you get it working, you can run pip uninstall findspark如果你得到它的工作,你可以运行pip uninstall findspark

EDIT:编辑:

Pure python solution, add this code on top of your jupyter notebook (maybe in the first cell):纯 python 解决方案,将此代码添加到您的 jupyter notebook 顶部(可能在第一个单元格中):

import os
import sys
os.environ["PYSPARK_PYTHON"] = "/opt/continuum/anaconda/bin/python"
os.environ["SPARK_HOME"] = "/opt/spark2.4"
os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"
sys.path.insert(0, os.environ["PYLIB"] +"/py4j-0.9-src.zip")
sys.path.insert(0, os.environ["PYLIB"] +"/pyspark.zip")

Source: Anaconda docs资料来源: Anaconda 文档

I believe you can call this only once, what this does is that it edits your bashrc file and set the environment variables there我相信你只能调用一次,它的作用是编辑你的 bashrc 文件并在那里设置环境变量

findspark.init('/path/to/spark_home', edit_rc=True)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM