简体   繁体   中英

Jupyter Notebook running through vscode jupyter server getting ModuleNotFoundError: No module named from pyspark on Amazon EMR

I am running my Jupyter Notebook on a remote server, which had pyspark and jupyter installed on:

  • usr/bin/pyspark
  • usr/local/bin/jupyter

I started the Jupyter server by calling pyspark:

  • export PYSPARK_DRIVER_PYTHON=jupyter
  • export PYSPARK_DRIVER_PYTHON_OPTS='notebook --no-browser'
  • nohup pyspark &

My jupyter notebook is calling my python module in the same folder and running on Amazon EMR:

from my_module import *
spark.sparkContext.addPyFile('my_module.py')

When I run my Jupyter notebook in my local browser(I SSH tunneled to server from local), it can work perfectly. If I convert the notebook to the equivalent python file and spark-submit to run, it also works.

But when I run the notebook in my vscode(with Microsoft Python extension installed) which remotely connected to the server and used the same jupyter server I created. It gave me "ModuleNotFoundError: No module named" complaining about "from my_module import *". Then I compared the working directory:

From my browser running jupyter notebook, the working directory was the same folder as my notebook. But in vscode, the jupyter server working direcotry is my ~ directory on the remote server. Even I "os.chdir" and "sys.path.append" to my notebook directory, it still complained about "No module named" error. So I changed my import as

from projects/project_name/my_module import *

It worked. It looked like jupyter server in vscode still looked for my module through my ~ directory. Similar for addPyFile, it complained about can't find my_module.py at first, I need to change it to

spark.sparkContext.addPyFile('projects/project_name/my_module.py')

But running on EMR created another problem, it complained about

"No module named 'projects'"

I thought this was from the spark working nodes can't find the "projects". Because when I ran my notebook through local browser without

spark.sparkContext.addPyFile('my_module.py')

It gave me the error:

"No module named 'my_module'"

I thought this meant only my EMR master instance can see my_module, but not the other nodes. I added the '''addPyFile('my_module.py')''' and fixed the problem. But I still didn't get it worked from vscode jupyter server.

Ideally, I would like to put my notebook in my ~ directory. and get it run without putting __init__.py in projects and project_name folder.

Can anyone shed some lights on? Any help is greatly appreciated!

I met similar problems.

My solution: I opened the Jupyter notebook in the terminal of the remote window rather than in other terminal windows. Then, copy the output URL to Jupyter Server. It works perfectly.

It seems that VScode cannot correctly recognize the file directory from the URL of the Jupyter server opened outside the current remote window.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM