简体   繁体   中英

Importing Spark in a Jupyter Notebook

I was following this guide and found that it used a variable named sc, which I figured is the Spark library.
I tried to install Spark using this guide , though I'm not sure if it installed correctly. Now when I try to import PySpark in the notebook, it's not recognized.
I'm on windows, what should I do?

"I'm not sure if [Spark is] installed correctly ... on windows"

What happens if you run pyspark.cmd ? It should display something like that, and wait for you to enter commands:

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.1.1
      /_/

Using Python version 3.6.0 (default, Dec 23 2016 12:22:00)
SparkSession available as 'spark'.
>>>

At this point you can type spark and/or sc to check whether you have an implicit SparkSession (V2 only) and/or SparkContext (deprecated in V2 but still there for compatibility) object(s).

But if you see a swarm of error messages and no sc object has been created, then you must enable verbose logging by adding this property at the end of your log4j.properties file...

log4j.logger.org.apache.spark.repl.Main=DEBUG

...and enjoy your next 5 days of debugging.
Also, read carefully https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-tips-and-tricks-running-spark-windows.html


when I try to import PySpark in the notebook, it's not recognized

The proper way to "define" PySpark in Jupyter is to create a kernel configuration file, in one of the default locations (cf. http://jupyter-client.readthedocs.io/en/latest/kernels.html )

To define a new kernel, then create a sub-directory (the name is not important) with inside a file named kernel.json (that name exactly) that looks like...

{ "display_name": "PySpark 2.1.1",
  "language": "python",
  "argv": [
     "/usr/local/bin/python3",
     "-m", "ipykernel",
     "-f", "{connection_file}" ],
  "env": {
    "SPARK_HOME": "/opt/spark/spark-2.1.1-bin-hadoop2.7",
    "PYTHONPATH": "/opt/spark/spark-2.1.1-bin-hadoop2.7/python/:/opt/spark/spark-2.1.1-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip",
    "PYTHONSTARTUP": "/opt/spark/spark-2.1.1-bin-hadoop2.7/python/pyspark/shell.py",
    "PYSPARK_SUBMIT_ARGS": "pyspark-shell" }
}

That's an example taken from a Linux box; now it's your job to adapt the paths to your actual Windows install, and adapt the Py4J version if required.

Note that you can stuff additional parameters in PYSPARK_SUBMIT_ARGS , to override spark-defaults.conf , eg

"PYSPARK_SUBMIT_ARGS": "--master local[4] --conf spark.driver.memory=4G --conf spark.python.worker.memory=512M --conf spark.rdd.compress=true --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.local.dir=C:/TEMP/spark --conf spark.driver.extraClassPath=C:/path/to/myjdbcdriver.jar pyspark-shell"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM