简体   繁体   中英

Running a pyspark program on python3 kernel in jupyter notebook

I used pip install pyspark to install PySpark. I didn't set any path etc.; however, I found that everything was downloaded and copied into C:/Users/Admin/anaconda3/scripts . I opened jupyter notebook in a Python3 kernel and I tried to run a SystemML script but it was giving me an error. I realized that I needed to place winutils.exe in C:/Users/Admin/anaconda3/scripts as well, so I did that and the script ran as expected.

Now, my program includes GridSearch and when I run it on my personal laptop, it is markedly slower than how it is on a Cloud data platform where I can initiate a kernel with Spark (such as IBM Watson Studio).

So my questions are:

(i) How do I add PySpark to the Python3 kernel? Or is it already working in the background when I import pyspark ?

(ii) When I run the same code on the same dataset using pandas and scikit-learn, there is not much difference in performance. When is PySpark preferred/beneficial over pandas and scikit-learn?

Another thing is, even though PySpark seems to be working fine and I'm able to import its libraries, when I try to run

import findspark
findspark.init()

it throws up and error (on line 2), saying the list is out of range . I googled a bit and found an advice that said that I had to explicitly set SPARK_HOME='C:/Users/Admin/anaconda3/Scripts' ; but when I do that, pyspark stops working (findspark.init() still not working).

If anyone can explain what is going on, I'd be very grateful. Thank you.

How do I add PySpark to the Python3 kernel

pip install , like you've said you have done

there is not much difference in performance

You're only using one machine, so there wouldn't be

When is PySpark preferred/beneficial over pandas and scikit-learn?

When you want to deploy the same code onto an actual Spark cluster and your dataset is stored in distributed storage


You don't necessarily need findspark if your environment variables are already setup

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM