Running Jupyter PySpark notebook in EMR, module not found, although it is installed

Question

I am trying to run a Jupyter notebook in EMR. The notebook is using PySpark kernel. All packages needed for the notebook are installed via boostrap actions but when the notebook is run, it fails with the message of failed import:

An error was encountered:
No module named 'xxxxx'
Traceback (most recent call last):
ModuleNotFoundError: No module named 'xxxxx'

How can I can tell Jupyter to use the packages installed on the cluster. By the way, "the cluster" consists of only a master node. I am guessing Jupyter is using its own virtual environment or something similar and that is the reason why it doesn't see packages installed.

Answer 1

Ok, after some investigation I learned these things...

In order to modify Python packages installed in and recognized by EMR Notebook Pyspark kernel, packages should be installed using command:

sudo pip3 install $package

Important bit here is sudo pip3 .

On the other hand, if I want to modify Python packages recognized by EMR Notebook Python3 kernel, they should be dealt with using this approach:

# Init and activate conda environment,
# in order to use its pip to install packages.
/emr/notebook-env/bin/conda init
source /home/hadoop/.bashrc

# Conda should be on the path now.
conda activate

sudo /emr/notebook-env/bin/pip3 install $package
...

The trick was to initialize Conda environment, which was in read-only state or something... After that, activate that environment and use pip installation corresponding to that environment.

Running Jupyter PySpark notebook in EMR, module not found, although it is installed

Question

1 answers

solution1
0 ACCPTED 2022-12-07 14:44:28

Running Jupyter PySpark notebook in EMR, module not found, although it is installed

Question

1 answers

solution1 0 ACCPTED 2022-12-07 14:44:28

solution1
0 ACCPTED 2022-12-07 14:44:28