简体   繁体   English

如何在 EMR 上安装软件包

[英]How to install packages on EMR

I created a cluster on AWS and with Jupyter, python3 installed.我在 AWS 上创建了一个集群,并安装了 Jupyter 和 python3。 Now I can type code in the cells and I found 'numpy' is installed, ie, by import numpy as np , I am able to access the functions in this package.现在我可以在单元格中键入代码,我发现 'numpy' 已安装,即通过import numpy as np ,我能够访问此包中的功能。 However, I found pandas is not there.但是,我发现pandas不在那里。 So in the next cell I typed !pip install pandas , then it displays所以在下一个单元格中我输入!pip install pandas ,然后它显示

Requirement already satisfied: pandas in /mnt/usrmoved/local/lib64/python2.7/site-packages
Requirement already satisfied: pytz>=2011k in /mnt/usrmoved/local/lib/python2.7/site-packages (from pandas)
Requirement already satisfied: numpy>=1.7.0 in /mnt/usrmoved/local/lib64/python2.7/site-packages (from pandas)
Requirement already satisfied: python-dateutil in /mnt/usrmoved/local/lib/python2.7/site-packages (from pandas)
Requirement already satisfied: six>=1.5 in /mnt/usrmoved/local/lib/python2.7/site-packages (from python-dateutil->pandas)

I thought it is successfully installed, but then in the next cell, I type import pandas as pd it gives me an error我以为它已成功安装,但是在下一个单元格中,我键入import pandas as pd它给了我一个错误

---------------------------------------------------------------------------
ImportError                               
Traceback (most recent call last)<ipython-input-8-af55e7023913> in <module>()----> 1 import pandas as pd

ImportError: No module named 'pandas'

In general, how should we install related python packages in EMR?一般来说,我们应该如何在EMR中安装相关的python包?

In my laptop, in the jupyter, I always did "! pip install package" and it works.在我的笔记本电脑中,在 jupyter 中,我总是执行“!pip install package”并且它有效。 But why it does not work in jupyer on EMR?但是为什么它在 EMR 上的 jupyer 中不起作用?

I tried installing python packages using pip install , but I get the pip: command not found .我尝试使用pip install安装 python 包,但我得到了pip: command not found So I used pip3 instead of pip, and it worked.所以我用pip3而不是 pip,它奏效了。

Using EMR 5.30.1使用 EMR 5.30.1

The conventional method to install python packages on EMR is to specify the packages needed at cluster creation using a bootstrap-action.在 EMR 上安装 python 包的传统方法是使用引导操作指定创建集群时所需的包。

This method ensures the packages are installed on all nodes and not just the driver.此方法可确保包安装在所有节点上,而不仅仅是驱动程序。

aws emr create-cluster \
--name 'test python packages' \
--release-label emr-5.20.0 \
--region us-east-1 \
--use-default-roles
--instance-type m4.large \
--instance-count 2 \
--bootstrap-actions \
    Path="s3://your-bucket/python-modules.sh",Name='Install Python Modules' \

The python-modules.sh would contain commands to install the python packages. python-modules.sh将包含安装 python 包的命令。 For example:例如:

#!/bin/sh

# Install needed packages
sudo pip install pandas

AWS documentation AWS 文档

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM