Configure Ipython/Jupyter notebook with Pyspark on AWS EMR v4.0.0

Question

I am trying to use IPython notebook with Apache Spark 1.4.0. I have followed the 2 tutorial below to set my configuration

Installing Ipython notebook with pyspark 1.4 on AWS

and

Configuring IPython notebook support for Pyspark

After fisnish the configuration, following is several code in the related files:

1.ipython_notebook_config.py

c=get_config()
c.NotebookApp.ip = '*'
c.NotebookApp.open_browser =False
c.NotebookApp.port = 8193

2.00-pyspark-setup.py

import os
import sys
spark_home = os.environ.get('SPARK_HOME', None)
sys.path.insert(0, spark_home + "/python")

# Add the py4j to the path.
# You may need to change the version number to match your install

sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.8.2.1-src.zip'))
# Initialize PySpark to predefine the SparkContext variable 'sc'
execfile(os.path.join(spark_home, 'python/pyspark/shell.py'))

I also add following two lines to my .bash_profile:

export SPARK_HOME='home/hadoop/sparl'
source ~/.bash_profile

However, when I run

ipython notebook --profile=pyspark

it shows the message: unrecognized alias '--profile=pyspark' it will probably have no effect

It seems that the notebook doesn't configure with pyspark successfully Does anyone know how to solve it? Thank you very much

following are some software version

ipython/Jupyter: 4.0.0

spark 1.4.0

AWS EMR: 4.0.0

python: 2.7.9

By the way I have read the following, but it doesn't work IPython notebook won't read the configuration file

Answer 1

Jupyter notebooks don't have the concept of profiles (as IPython did). The recommended way of launching with a different configuration is eg:

JUPTYER_CONFIG_DIR=~/alternative_jupyter_config_dir jupyter notebook

See also issue jupyter/notebook#309 , where you'll find a comment describing how to set up Jupyter notebook with PySpark without profiles or kernels .

Answer 2

This worked for me...

Update ~/.bashrc with:

export SPARK_HOME="<your location of spark>"
export PYSPARK_SUBMIT_ARGS="--master local[2] pyspark-shell"

(Lookup pyspark docs for those arguments)

Then create a new ipython profile eg. pyspark:

ipython profile create pyspark

Then create and add the following lines in ~/.ipython/profile_pyspark/startup/00-pyspark-setup.py :

import os
import sys

spark_home = os.environ.get('SPARK_HOME', None)
sys.path.insert(0, spark_home + "/python")
sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.9-src.zip'))

filename = os.path.join(spark_home, 'python/pyspark/shell.py')
exec(compile(open(filename, "rb").read(), filename, 'exec'))

spark_release_file = spark_home + "/RELEASE"

if os.path.exists(spark_release_file) and "Spark 1.6" in open(spark_release_file).read():
    pyspark_submit_args = os.environ.get("PYSPARK_SUBMIT_ARGS", "")

(update versions of py4j and spark to suit your case)

Then mkdir -p ~/.ipython/kernels/pyspark and then create and add following lines in the file ~/.ipython/kernels/pyspark/kernel.json

{
 "display_name": "pySpark (Spark 1.6.1)",
 "language": "python",
 "argv": [
  "/usr/bin/python",
  "-m",
  "IPython.kernel",
  "--profile=pyspark",
  "-f",
  "{connection_file}"
 ]
}

Now you should see this kernel, pySpark (Spark 1.6.1), under jupyter's new notebook option. You can test by executing sc and should see your spark context.

Answer 3

I have tried so many ways to solve this 4.0 version problem, and finally I decided to install version 3.2.3. of IPython:

conda install 'ipython<4'

It's anazoning! And wish to help all you!

ref: https://groups.google.com/a/continuum.io/forum/#!topic/anaconda/ace9F4dWZTA

Answer 4

As people commented, in Jupyter you don't need profiles. All you need to do is export the variables for jupyter to find your spark install (I use zsh but it's the same for bash)

emacs ~/.zshrc
export PATH="/Users/hcorona/anaconda/bin:$PATH"
export SPARK_HOME="$HOME/spark"
export PATH=$SPARK_HOME/bin:$PATH
export PYSPARK_SUBMIT_ARGS="--master local[*,8] pyspark-shell"
export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.9-src.zip:$PYTHONPATH

It is important to add pyspark-shell in the PYSPARK_SUBMIT_ARGS I found this guide useful but not fully accurate.

My config is local, but should work if you use the PYSPARK_SUBMIT_ARGS to the ones you need.

Answer 5

I am having the same problem to specify the --profile **kwarg. It seems it is a general problem with the new version, not related with Spark. If you downgrade to ipython 3.2.1 you will be able to specify the profile again.

Configure Ipython/Jupyter notebook with Pyspark on AWS EMR v4.0.0

Question

5 answers

solution1
4 2015-10-26 19:44:06

solution2
1 2016-03-31 19:24:08

solution3
0 2016-02-27 07:58:52

solution4
0 2016-03-16 16:54:36

solution5
-1 2015-10-15 10:23:50

Configure Ipython/Jupyter notebook with Pyspark on AWS EMR v4.0.0

Question

5 answers

solution1 4 2015-10-26 19:44:06

solution2 1 2016-03-31 19:24:08

solution3 0 2016-02-27 07:58:52

solution4 0 2016-03-16 16:54:36

solution5 -1 2015-10-15 10:23:50

solution1
4 2015-10-26 19:44:06

solution2
1 2016-03-31 19:24:08

solution3
0 2016-02-27 07:58:52

solution4
0 2016-03-16 16:54:36

solution5
-1 2015-10-15 10:23:50