[英]Running pySpark in Jupyter notebooks - Windows
I would like to run pySpark from Jupyter notebook.我想从 Jupyter notebook 运行 pySpark。 I downloaded and installed Anaconda which had Juptyer.
我下载并安装了带有 Juptyer 的 Anaconda。 I created the following lines
我创建了以下几行
from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster("local").setAppName("My App")
sc = SparkContext(conf = conf)
I get the following error我收到以下错误
ImportError Traceback (most recent call last)
<ipython-input-3-98c83f0bd5ff> in <module>()
----> 1 from pyspark import SparkConf, SparkContext
2 conf = SparkConf().setMaster("local").setAppName("My App")
3 sc = SparkContext(conf = conf)
C:\software\spark\spark-1.6.2-bin-hadoop2.6\python\pyspark\__init__.py in <module>()
39
40 from pyspark.conf import SparkConf
---> 41 from pyspark.context import SparkContext
42 from pyspark.rdd import RDD
43 from pyspark.files import SparkFiles
C:\software\spark\spark-1.6.2-bin-hadoop2.6\python\pyspark\context.py in <module>()
26 from tempfile import NamedTemporaryFile
27
---> 28 from pyspark import accumulators
29 from pyspark.accumulators import Accumulator
30 from pyspark.broadcast import Broadcast
ImportError: cannot import name accumulators
I tried adding the following environment variable PYTHONPATH which points to the spark/python directory, based on an answer in Stackoverflow importing pyspark in python shell根据 Stackoverflow importing pyspark in python shell 中的答案,我尝试添加以下指向 spark/python 目录的环境变量 PYTHONPATH
but this was of no help但这没有帮助
This worked for me:这对我有用:
import os
import sys
spark_path = "D:\spark"
os.environ['SPARK_HOME'] = spark_path
os.environ['HADOOP_HOME'] = spark_path
sys.path.append(spark_path + "/bin")
sys.path.append(spark_path + "/python")
sys.path.append(spark_path + "/python/pyspark/")
sys.path.append(spark_path + "/python/lib")
sys.path.append(spark_path + "/python/lib/pyspark.zip")
sys.path.append(spark_path + "/python/lib/py4j-0.9-src.zip")
from pyspark import SparkContext
from pyspark import SparkConf
sc = SparkContext("local", "test")
To verify:验证:
In [2]: sc
Out[2]: <pyspark.context.SparkContext at 0x707ccf8>
2018 version 2018版
INSTALL PYSPARK on Windows 10 JUPYTER-NOTEBOOK With ANACONDA NAVIGATOR使用 ANACONDA NAVIGATOR 在 Windows 10 JUPYTER-NOTEBOOK 上安装 PYSPARK
Download Packages下载包
1) spark-2.2.0-bin-hadoop2.7.tgz Download 1) spark-2.2.0-bin-hadoop2.7.tgz下载
2) java jdk 8 version Download 2) java jdk 8 版本下载
3) Anaconda v 5.2 Download 3) Anaconda v 5.2下载
4) scala-2.12.6.msi Download 4) scala-2.12.6.msi下载
5) hadoop v2.7.1 Download 5) hadoop v2.7.1下载
MAKE SPARK FOLDER IN C:/ DRIVE AND PUT EVERYTHING INSIDE IT It will look like this在C:/驱动器中制作 SPARK 文件夹并将所有内容放入其中它看起来像这样
NOTE : DURING INSTALLATION OF SCALA GIVE PATH OF SCALA INSIDE SPARK FOLDER注意:在安装 SCALA 的过程中,在 Spark 文件夹中给出 SCALA 的路径
NOW SET NEW WINDOWS ENVIRONMENT VARIABLES现在设置新的 WINDOWS 环境变量
HADOOP_HOME=C:\\spark\\hadoop
JAVA_HOME=C:\\Program Files\\Java\\jdk1.8.0_151
SCALA_HOME=C:\\spark\\scala\\bin
SPARK_HOME=C:\\spark\\spark\\bin
PYSPARK_PYTHON=C:\\Users\\user\\Anaconda3\\python.exe
PYSPARK_DRIVER_PYTHON=C:\\Users\\user\\Anaconda3\\Scripts\\jupyter.exe
PYSPARK_DRIVER_PYTHON_OPTS=notebook
NOW SELECT PATH OF SPARK :现在选择火花路径:
Click on Edit and add New单击编辑并添加新
Add " C:\\spark\\spark\\bin ” to variable “Path” Windows将“ C:\\spark\\spark\\bin ”添加到变量“Path” Windows
thats it your browser will pop up with Juypter localhost就是这样,您的浏览器将弹出 Juypter localhost
Check pyspark is working or not !检查 pyspark 是否正常工作!
Type simple code and run it输入简单代码并运行
from pyspark.sql import Row
a = Row(name = 'Vinay' , age=22 , height=165)
print("a: ",a)
Running pySpark in Jupyter notebooks - Windows在 Jupyter 笔记本中运行 pySpark - Windows
JAVA8 : https://www.guru99.com/install-java.html JAVA8 : https://www.guru99.com/install-java.html
Anakonda : https://www.anaconda.com/distribution/阿纳康达: https ://www.anaconda.com/distribution/
Pyspark in jupyter : https://changhsinlee.com/install-pyspark-windows-jupyter/ jupyter 中的 Pyspark: https ://changhsinlee.com/install-pyspark-windows-jupyter/
import findspark
findspark.init()
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
spark = SparkSession.builder.appName('test').getOrCreate()
data = [(1, "siva", 100), (2, "siva2", 200),(3, "siva3", 300),(4, "siva4", 400),(5, "siva5", 500)]
schema = ['id', 'name', 'sallary']
df = spark.createDataFrame(data, schema=schema)
df.show()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.