简体   繁体   English

在 Jupyter 笔记本中运行 pySpark - Windows

[英]Running pySpark in Jupyter notebooks - Windows

I would like to run pySpark from Jupyter notebook.我想从 Jupyter notebook 运行 pySpark。 I downloaded and installed Anaconda which had Juptyer.我下载并安装了带有 Juptyer 的 Anaconda。 I created the following lines我创建了以下几行

 from pyspark import SparkConf, SparkContext
 conf = SparkConf().setMaster("local").setAppName("My App")
 sc = SparkContext(conf = conf)

I get the following error我收到以下错误

ImportError                               Traceback (most recent call last)
<ipython-input-3-98c83f0bd5ff> in <module>()
  ----> 1 from pyspark import SparkConf, SparkContext
  2 conf = SparkConf().setMaster("local").setAppName("My App")
  3 sc = SparkContext(conf = conf)

 C:\software\spark\spark-1.6.2-bin-hadoop2.6\python\pyspark\__init__.py in   <module>()
 39 
 40 from pyspark.conf import SparkConf
  ---> 41 from pyspark.context import SparkContext
 42 from pyspark.rdd import RDD
 43 from pyspark.files import SparkFiles

 C:\software\spark\spark-1.6.2-bin-hadoop2.6\python\pyspark\context.py in <module>()
 26 from tempfile import NamedTemporaryFile
 27 
 ---> 28 from pyspark import accumulators
 29 from pyspark.accumulators import Accumulator
 30 from pyspark.broadcast import Broadcast

 ImportError: cannot import name accumulators

I tried adding the following environment variable PYTHONPATH which points to the spark/python directory, based on an answer in Stackoverflow importing pyspark in python shell根据 Stackoverflow importing pyspark in python shell 中的答案,我尝试添加以下指向 spark/python 目录的环境变量 PYTHONPATH

but this was of no help但这没有帮助

This worked for me:这对我有用:

import os
import sys

spark_path = "D:\spark"

os.environ['SPARK_HOME'] = spark_path
os.environ['HADOOP_HOME'] = spark_path

sys.path.append(spark_path + "/bin")
sys.path.append(spark_path + "/python")
sys.path.append(spark_path + "/python/pyspark/")
sys.path.append(spark_path + "/python/lib")
sys.path.append(spark_path + "/python/lib/pyspark.zip")
sys.path.append(spark_path + "/python/lib/py4j-0.9-src.zip")

from pyspark import SparkContext
from pyspark import SparkConf

sc = SparkContext("local", "test")

To verify:验证:

In [2]: sc
Out[2]: <pyspark.context.SparkContext at 0x707ccf8>

2018 version 2018版

INSTALL PYSPARK on Windows 10 JUPYTER-NOTEBOOK With ANACONDA NAVIGATOR使用 ANACONDA NAVIGATOR 在 Windows 10 JUPYTER-NOTEBOOK 上安装 PYSPARK

STEP 1第 1 步

Download Packages下载包

1) spark-2.2.0-bin-hadoop2.7.tgz Download 1) spark-2.2.0-bin-hadoop2.7.tgz下载

2) java jdk 8 version Download 2) java jdk 8 版本下载

3) Anaconda v 5.2 Download 3) Anaconda v 5.2下载

4) scala-2.12.6.msi Download 4) scala-2.12.6.msi下载

5) hadoop v2.7.1 Download 5) hadoop v2.7.1下载

STEP 2第 2 步

MAKE SPARK FOLDER IN C:/ DRIVE AND PUT EVERYTHING INSIDE IT It will look like thisC:/驱动器中制作 SPARK 文件夹并将所有内容放入其中它看起来像这样

NOTE : DURING INSTALLATION OF SCALA GIVE PATH OF SCALA INSIDE SPARK FOLDER注意:在安装 SCALA 的过程中,在 Spark 文件夹中给出 SCALA 的路径

STEP 3第 3 步

NOW SET NEW WINDOWS ENVIRONMENT VARIABLES现在设置新的 WINDOWS 环境变量

  1. HADOOP_HOME=C:\\spark\\hadoop

  2. JAVA_HOME=C:\\Program Files\\Java\\jdk1.8.0_151

  3. SCALA_HOME=C:\\spark\\scala\\bin

  4. SPARK_HOME=C:\\spark\\spark\\bin

  5. PYSPARK_PYTHON=C:\\Users\\user\\Anaconda3\\python.exe

  6. PYSPARK_DRIVER_PYTHON=C:\\Users\\user\\Anaconda3\\Scripts\\jupyter.exe

  7. PYSPARK_DRIVER_PYTHON_OPTS=notebook

  8. NOW SELECT PATH OF SPARK :现在选择火花路径

    Click on Edit and add New单击编辑并添加新

    Add " C:\\spark\\spark\\bin ” to variable “Path” Windows将“ C:\\spark\\spark\\bin ”添加到变量“Path” Windows

STEP 4第 4 步

  • Make folder where you want to store Jupyter-Notebook outputs and files创建要存储 Jupyter-Notebook 输出和文件的文件夹
  • After that open Anaconda command prompt and cd Folder name之后打开 Anaconda 命令提示符和cd 文件夹名称
  • then enter Pyspark然后进入Pyspark

thats it your browser will pop up with Juypter localhost就是这样,您的浏览器将弹出 Juypter localhost

STEP 5第 5 步

Check pyspark is working or not !检查 pyspark 是否正常工作!

Type simple code and run it输入简单代码并运行

from pyspark.sql import Row
a = Row(name = 'Vinay' , age=22 , height=165)
print("a: ",a)

Running pySpark in Jupyter notebooks - Windows在 Jupyter 笔记本中运行 pySpark - Windows

JAVA8 : https://www.guru99.com/install-java.html JAVA8 : https://www.guru99.com/install-java.html

Anakonda : https://www.anaconda.com/distribution/阿纳康达: https ://www.anaconda.com/distribution/

Pyspark in jupyter : https://changhsinlee.com/install-pyspark-windows-jupyter/ jupyter 中的 Pyspark: https ://changhsinlee.com/install-pyspark-windows-jupyter/

import findspark

findspark.init()

from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *

spark = SparkSession.builder.appName('test').getOrCreate()
data = [(1, "siva", 100), (2, "siva2", 200),(3, "siva3", 300),(4, "siva4", 400),(5, "siva5", 500)]
schema = ['id', 'name', 'sallary']

df = spark.createDataFrame(data, schema=schema)
df.show()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM