簡體   English   中英

通過在 Jupyter Notebook 中不起作用的引導操作在 EMR 上安裝包

[英]Install packages on EMR via bootstrap actions not working in Jupyter notebook

我有一個使用 EMR-6.3.1 的 EMR 集群。 我正在使用 Python3 Kernel。

我在 S3 中有一個非常簡單的引導腳本:

#!/bin/bash

sudo python3 -m pip install Cython==0.29.4 boto==2.49.0 boto3==1.18.50 numpy==1.19.5 pandas==1.3.2 pyarrow==5.0.0

這些是引導日志

+ sudo python3 -m pip install Cython==0.29.4 boto==2.49.0 boto3==1.18.50 numpy==1.19.5 pandas==1.3.2 pyarrow==5.0.0
WARNING: Running pip install with root privileges is generally not a good idea. Try `python3 -m pip install --user` instead.
  WARNING: The scripts cygdb, cython and cythonize are installed in '/usr/local/bin' which is not on PATH.
  Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
  WARNING: The scripts f2py, f2py3 and f2py3.7 are installed in '/usr/local/bin' which is not on PATH.
  Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
  WARNING: The script plasma_store is installed in '/usr/local/bin' which is not on PATH.
  Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.

Collecting Cython==0.29.4
  Downloading Cython-0.29.4-cp37-cp37m-manylinux1_x86_64.whl (2.1 MB)
Requirement already satisfied: boto==2.49.0 in /usr/local/lib/python3.7/site-packages (2.49.0)
Collecting boto3==1.18.50
  Downloading boto3-1.18.50-py3-none-any.whl (131 kB)
Collecting numpy==1.19.5
  Downloading numpy-1.19.5-cp37-cp37m-manylinux2010_x86_64.whl (14.8 MB)
Collecting pandas==1.3.2
  Downloading pandas-1.3.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.3 MB)
Collecting pyarrow==5.0.0
  Downloading pyarrow-5.0.0-cp37-cp37m-manylinux2014_x86_64.whl (23.6 MB)
Collecting s3transfer<0.6.0,>=0.5.0
  Downloading s3transfer-0.5.2-py3-none-any.whl (79 kB)
Collecting botocore<1.22.0,>=1.21.50
  Downloading botocore-1.21.65-py3-none-any.whl (8.0 MB)
Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /usr/local/lib/python3.7/site-packages (from boto3==1.18.50) (0.10.0)
Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.7/site-packages (from pandas==1.3.2) (2021.1)
Collecting python-dateutil>=2.7.3
  Downloading python_dateutil-2.8.2-py2.py3-none-any.whl (247 kB)
Collecting urllib3<1.27,>=1.25.4
  Downloading urllib3-1.26.13-py2.py3-none-any.whl (140 kB)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/site-packages (from python-dateutil>=2.7.3->pandas==1.3.2) (1.13.0)
Installing collected packages: Cython, python-dateutil, urllib3, botocore, s3transfer, boto3, numpy, pandas, pyarrow
Successfully installed Cython-0.29.4 boto3-1.18.50 botocore-1.21.65 numpy-1.19.5 pandas-1.3.2 pyarrow-5.0.0 python-dateutil-2.8.2 s3transfer-0.5.2 urllib3-1.26.13

從筆記本中導入 pandas 並看到錯誤的版本 - 1.2.3。 此外,我看到 pyarrow 無法導入。

我已經打印了 pandas 的導入路徑,運行的是 python 版本,以及 sys.path。

import os
import pandas
import sys
print(sys.path)
print(pandas.__version__)
print(os.path.abspath(pandas.__file__))
print(os.popen('echo $PYTHONPATH').read())
print(os.popen('which python3').read())

# sys.path.append('/usr/local/lib64/python3.7/site-packages') # if I add this, pyarrow can import
import pyarrow

['/', '/emr/notebook-env/lib/python37.zip', '/emr/notebook-env/lib/python3.7', '/emr/notebook-env/lib/python3.7/lib-dynload', '', '/emr/notebook-env/lib/python3.7/site-packages', '/emr/notebook-env/lib/python3.7/site-packages/awseditorssparkmonitoringwidget-1.0-py3.7.egg', '/emr/notebook-env/lib/python3.7/site-packages/IPython/extensions', '/home/emr-notebook/.ipython']
1.2.3
/emr/notebook-env/lib/python3.7/site-packages/pandas/__init__.py


/usr/bin/python3

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-2-aea9862499ce> in <module>
      9 
     10 # sys.path.append('/usr/local/lib64/python3.7/site-packages') # if I add this, pyarrow can import
---> 11 import pyarrow

ModuleNotFoundError: No module named 'pyarrow'

我發現如果我將/usr/local/lib64/python3.7/site-packages添加到 sys.path,我可以導入 pyarrow。 這似乎是改進,但仍然導入了錯誤版本的 pandas。

我試過了:

  • 通過 SSH 連接到主節點並處理配置。
  • sudo python3 -m pip install --user...
  • export PYTHONPATH=/usr/local/lib64/python3.7/site-packages && sudo python3 -m pip install...
  • sudo pip3 install --upgrade setuptools && sudo python3 -m pip install...
  • 使用 pyspark kernel 並運行sc.install_pypi_package("pandas==1.3.2")

任何幫助表示贊賞。 謝謝你。

EMR 上的引導程序配置不是集群等待和 EMR 步驟開始運行之前的最后一步。

在我的 emr 集群上,我發現至少這些包在引導配置運行后被記錄為已安裝。 我遇到了 numpy 沒有升級的問題。

Python 包安裝后引導

2022-12-10 00:10:28,250 INFO main:安裝包用了 1 分 3 秒 451 毫秒:

[emr-腳本,emr-s3-select,aws-sagemaker-spark-sdk,python27-numpy,python27-sagemaker_pyspark,python37-numpy,python37-sagemaker_pyspark,emr-ddb,hadoop-yarn-nodemanager,docker,hadoop-yarn , spark-yarn-shuffle, bigtop-utils, cloudwatch-sink, hadoop, hadoop-lzo, emr-goodies, emrfs, hadoop-mapreduce, hadoop-hdfs, R-core, aws-hm-client, emr-kinesis, hadoop -hdfs-datanode, spark-datanucleus]

在您的第一個集群步驟中解決添加安裝的問題

...
Steps: [

        {
            'Name': 'Install Pandas',
            "ActionOnFailure": "CONTINUE",
            "HadoopJarStep": {
                "Jar":
                "s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar",
                "Args": [
                    "s3://bucket/prefix/install_packages.sh"
                ]
            }
        },
]

或者如果你想使用 command-runner

{
            "Name": "Install Pandas",
            "ActionOnFailure": "CONTINUE",
            "HadoopJarStep": {
                "Jar": "command-runner.jar",
                "Args": [
                    "bash",
                    "-c",
                    " aws s3 cp s3://bucket/prefix/install.sh .; chmod +x install.sh; ./install.sh; rm install_pandas.sh "
                ]
            }
        }

在我的示例中,我的 install.sh 文件看起來像

#!/bin/bash
set -x

sudo pip3 freeze # for debugging to view the previous package versions
sudo pip3 uninstall numpy -y -v
sudo yum install python3-devel -y
sudo pip3 install boto3==1.26.26 -v
sudo pip3 install numpy==1.21.6 -v
sudo pip3 install pandas==1.3.5 -v
sudo pip3 freeze # for debugging to view the post-install package versions

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM