简体   繁体   中英

Importing Spark in a Jupyter Notebook

I was following this guide and found that it used a variable named sc, which I figured is the Spark library.
I tried to install Spark using this guide , though I'm not sure if it installed correctly. Now when I try to import PySpark in the notebook, it's not recognized.
I'm on windows, what should I do?

"I'm not sure if [Spark is] installed correctly ... on windows"

What happens if you run pyspark.cmd ? It should display something like that, and wait for you to enter commands:

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.1.1

Using Python version 3.6.0 (default, Dec 23 2016 12:22:00)
SparkSession available as 'spark'.

At this point you can type spark and/or sc to check whether you have an implicit SparkSession (V2 only) and/or SparkContext (deprecated in V2 but still there for compatibility) object(s).

But if you see a swarm of error messages and no sc object has been created, then you must enable verbose logging by adding this property at the end of your log4j.properties file...


...and enjoy your next 5 days of debugging.
Also, read carefully https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-tips-and-tricks-running-spark-windows.html

when I try to import PySpark in the notebook, it's not recognized

The proper way to "define" PySpark in Jupyter is to create a kernel configuration file, in one of the default locations (cf. http://jupyter-client.readthedocs.io/en/latest/kernels.html )

To define a new kernel, then create a sub-directory (the name is not important) with inside a file named kernel.json (that name exactly) that looks like...

{ "display_name": "PySpark 2.1.1",
  "language": "python",
  "argv": [
     "-m", "ipykernel",
     "-f", "{connection_file}" ],
  "env": {
    "SPARK_HOME": "/opt/spark/spark-2.1.1-bin-hadoop2.7",
    "PYTHONPATH": "/opt/spark/spark-2.1.1-bin-hadoop2.7/python/:/opt/spark/spark-2.1.1-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip",
    "PYTHONSTARTUP": "/opt/spark/spark-2.1.1-bin-hadoop2.7/python/pyspark/shell.py",
    "PYSPARK_SUBMIT_ARGS": "pyspark-shell" }

That's an example taken from a Linux box; now it's your job to adapt the paths to your actual Windows install, and adapt the Py4J version if required.

Note that you can stuff additional parameters in PYSPARK_SUBMIT_ARGS , to override spark-defaults.conf , eg

"PYSPARK_SUBMIT_ARGS": "--master local[4] --conf spark.driver.memory=4G --conf spark.python.worker.memory=512M --conf spark.rdd.compress=true --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.local.dir=C:/TEMP/spark --conf spark.driver.extraClassPath=C:/path/to/myjdbcdriver.jar pyspark-shell"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM