通过在 Jupyter Notebook 中不起作用的引导操作在 EMR 上安装包

Question

I have an EMR cluster using EMR-6.3.1.我有一个使用 EMR-6.3.1 的 EMR 集群。 I am using the Python3 Kernel.我正在使用 Python3 Kernel。

I have a very simple bootstrap script in S3:我在 S3 中有一个非常简单的引导脚本：

#!/bin/bash

sudo python3 -m pip install Cython==0.29.4 boto==2.49.0 boto3==1.18.50 numpy==1.19.5 pandas==1.3.2 pyarrow==5.0.0

These are the bootstrap logs这些是引导日志

+ sudo python3 -m pip install Cython==0.29.4 boto==2.49.0 boto3==1.18.50 numpy==1.19.5 pandas==1.3.2 pyarrow==5.0.0
WARNING: Running pip install with root privileges is generally not a good idea. Try `python3 -m pip install --user` instead.
  WARNING: The scripts cygdb, cython and cythonize are installed in '/usr/local/bin' which is not on PATH.
  Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
  WARNING: The scripts f2py, f2py3 and f2py3.7 are installed in '/usr/local/bin' which is not on PATH.
  Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
  WARNING: The script plasma_store is installed in '/usr/local/bin' which is not on PATH.
  Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.

and和

Collecting Cython==0.29.4
  Downloading Cython-0.29.4-cp37-cp37m-manylinux1_x86_64.whl (2.1 MB)
Requirement already satisfied: boto==2.49.0 in /usr/local/lib/python3.7/site-packages (2.49.0)
Collecting boto3==1.18.50
  Downloading boto3-1.18.50-py3-none-any.whl (131 kB)
Collecting numpy==1.19.5
  Downloading numpy-1.19.5-cp37-cp37m-manylinux2010_x86_64.whl (14.8 MB)
Collecting pandas==1.3.2
  Downloading pandas-1.3.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.3 MB)
Collecting pyarrow==5.0.0
  Downloading pyarrow-5.0.0-cp37-cp37m-manylinux2014_x86_64.whl (23.6 MB)
Collecting s3transfer<0.6.0,>=0.5.0
  Downloading s3transfer-0.5.2-py3-none-any.whl (79 kB)
Collecting botocore<1.22.0,>=1.21.50
  Downloading botocore-1.21.65-py3-none-any.whl (8.0 MB)
Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /usr/local/lib/python3.7/site-packages (from boto3==1.18.50) (0.10.0)
Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.7/site-packages (from pandas==1.3.2) (2021.1)
Collecting python-dateutil>=2.7.3
  Downloading python_dateutil-2.8.2-py2.py3-none-any.whl (247 kB)
Collecting urllib3<1.27,>=1.25.4
  Downloading urllib3-1.26.13-py2.py3-none-any.whl (140 kB)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/site-packages (from python-dateutil>=2.7.3->pandas==1.3.2) (1.13.0)
Installing collected packages: Cython, python-dateutil, urllib3, botocore, s3transfer, boto3, numpy, pandas, pyarrow
Successfully installed Cython-0.29.4 boto3-1.18.50 botocore-1.21.65 numpy-1.19.5 pandas-1.3.2 pyarrow-5.0.0 python-dateutil-2.8.2 s3transfer-0.5.2 urllib3-1.26.13

From a notebook, importing pandas and seeing the wrong version - 1.2.3.从笔记本中导入 pandas 并看到错误的版本 - 1.2.3。 Further, I see pyarrow fails to import.此外，我看到 pyarrow 无法导入。

I've printed the import path of pandas, which python version is run, and sys.path.我已经打印了 pandas 的导入路径，运行的是 python 版本，以及 sys.path。

import os
import pandas
import sys
print(sys.path)
print(pandas.__version__)
print(os.path.abspath(pandas.__file__))
print(os.popen('echo $PYTHONPATH').read())
print(os.popen('which python3').read())

# sys.path.append('/usr/local/lib64/python3.7/site-packages') # if I add this, pyarrow can import
import pyarrow

['/', '/emr/notebook-env/lib/python37.zip', '/emr/notebook-env/lib/python3.7', '/emr/notebook-env/lib/python3.7/lib-dynload', '', '/emr/notebook-env/lib/python3.7/site-packages', '/emr/notebook-env/lib/python3.7/site-packages/awseditorssparkmonitoringwidget-1.0-py3.7.egg', '/emr/notebook-env/lib/python3.7/site-packages/IPython/extensions', '/home/emr-notebook/.ipython']
1.2.3
/emr/notebook-env/lib/python3.7/site-packages/pandas/__init__.py


/usr/bin/python3

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-2-aea9862499ce> in <module>
      9 
     10 # sys.path.append('/usr/local/lib64/python3.7/site-packages') # if I add this, pyarrow can import
---> 11 import pyarrow

ModuleNotFoundError: No module named 'pyarrow'

I found I can import pyarrow if I add /usr/local/lib64/python3.7/site-packages to sys.path.我发现如果我将/usr/local/lib64/python3.7/site-packages添加到 sys.path，我可以导入 pyarrow。 This seems like am improvement, but still the wrong version of pandas is imported.这似乎是改进，但仍然导入了错误版本的 pandas。

I've tried:我试过了：

SSH'ing into the master node and mucking with the configuration.通过 SSH 连接到主节点并处理配置。
sudo python3 -m pip install --user...
export PYTHONPATH=/usr/local/lib64/python3.7/site-packages && sudo python3 -m pip install...
sudo pip3 install --upgrade setuptools && sudo python3 -m pip install...
Using a pyspark kernel and running sc.install_pypi_package("pandas==1.3.2")使用 pyspark kernel 并运行sc.install_pypi_package("pandas==1.3.2")

Any help is appreciated.任何帮助表示赞赏。 Thank you.谢谢你。

Answer 1

The bootstrap configuration on EMR is not the last step before the cluster is WAITING and EMR Steps start running. EMR 上的引导程序配置不是集群等待和 EMR 步骤开始运行之前的最后一步。

On my emr cluster I found that at the least these packages were logged as installed after the bootstrap configuration ran.在我的 emr 集群上，我发现至少这些包在引导配置运行后被记录为已安装。 I was having issues with numpy not upgrading.我遇到了 numpy 没有升级的问题。

Python packages installed post bootstrap Python 包安装后引导

2022-12-10 00:10:28,250 INFO main: Took 1 minute, 3 seconds and 451 milliseconds to install packages: 2022-12-10 00:10:28,250 INFO main：安装包用了 1 分 3 秒 451 毫秒：

[emr-scripts, emr-s3-select, aws-sagemaker-spark-sdk, python27-numpy, python27-sagemaker_pyspark, python37-numpy, python37-sagemaker_pyspark, emr-ddb, hadoop-yarn-nodemanager, docker, hadoop-yarn, spark-yarn-shuffle, bigtop-utils, cloudwatch-sink, hadoop, hadoop-lzo, emr-goodies, emrfs, hadoop-mapreduce, hadoop-hdfs, R-core, aws-hm-client, emr-kinesis, hadoop-hdfs-datanode, spark-datanucleus] [emr-脚本，emr-s3-select，aws-sagemaker-spark-sdk，python27-numpy，python27-sagemaker_pyspark，python37-numpy，python37-sagemaker_pyspark，emr-ddb，hadoop-yarn-nodemanager，docker，hadoop-yarn , spark-yarn-shuffle, bigtop-utils, cloudwatch-sink, hadoop, hadoop-lzo, emr-goodies, emrfs, hadoop-mapreduce, hadoop-hdfs, R-core, aws-hm-client, emr-kinesis, hadoop -hdfs-datanode, spark-datanucleus]

A work around to in your first cluster step to add the installition在您的第一个集群步骤中解决添加安装的问题

...
Steps: [

        {
            'Name': 'Install Pandas',
            "ActionOnFailure": "CONTINUE",
            "HadoopJarStep": {
                "Jar":
                "s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar",
                "Args": [
                    "s3://bucket/prefix/install_packages.sh"
                ]
            }
        },
]

Or if you want to use command-runner或者如果你想使用 command-runner

{
            "Name": "Install Pandas",
            "ActionOnFailure": "CONTINUE",
            "HadoopJarStep": {
                "Jar": "command-runner.jar",
                "Args": [
                    "bash",
                    "-c",
                    " aws s3 cp s3://bucket/prefix/install.sh .; chmod +x install.sh; ./install.sh; rm install_pandas.sh "
                ]
            }
        }

In my example my install.sh file looks like在我的示例中，我的 install.sh 文件看起来像

#!/bin/bash
set -x

sudo pip3 freeze # for debugging to view the previous package versions
sudo pip3 uninstall numpy -y -v
sudo yum install python3-devel -y
sudo pip3 install boto3==1.26.26 -v
sudo pip3 install numpy==1.21.6 -v
sudo pip3 install pandas==1.3.5 -v
sudo pip3 freeze # for debugging to view the post-install package versions

通过在 Jupyter Notebook 中不起作用的引导操作在 EMR 上安装包

问题描述

1 个解决方案

解决方案1
0 2022-12-14 04:13:04

通过在 Jupyter Notebook 中不起作用的引导操作在 EMR 上安装包

问题描述

1 个解决方案

解决方案1 0 2022-12-14 04:13:04

解决方案1
0 2022-12-14 04:13:04