简体   繁体   English

将 AWS Glue Python 与 NumPy 和 Pandas Python 程序包一起使用

[英]Use AWS Glue Python with NumPy and Pandas Python Packages

What is the easiest way to use packages such as NumPy and Pandas within the new ETL tool on AWS called Glue?在 AWS 上名为 Glue 的新 ETL 工具中使用 NumPy 和 Pandas 等包的最简单方法是什么? I have a completed script within Python I would like to run in AWS Glue that utilizes NumPy and Pandas.我在 Python 中有一个完整的脚本,我想在使用 NumPy 和 Pandas 的 AWS Glue 中运行。

You can check latest python packages installed using this script as glue job您可以检查使用此脚本作为粘合作业安装的最新 python 包

import logging
import pip
logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)

if __name__ == '__main__':
    logger.info(pip._internal.main(['list']))

As of 30-Jun-2020 Glue as has these python packages pre-installed.截至30-Jun-2020 Glue 已预装这些 Python 软件包。 So numpy and pandas is covered.所以numpypandas被覆盖了。

awscli 1.16.242
boto3 1.9.203
botocore 1.12.232
certifi 2020.4.5.1
chardet 3.0.4
colorama 0.3.9
docutils 0.15.2
idna 2.8
jmespath 0.9.4
numpy 1.16.2
pandas 0.24.2
pip 20.0.2
pyasn1 0.4.8
PyGreSQL 5.0.6
python-dateutil 2.8.1
pytz 2019.3
PyYAML 5.2
requests 2.22.0
rsa 3.4.2
s3transfer 0.2.1
scikit-learn 0.20.3
scipy 1.2.1
setuptools 45.1.0
six 1.14.0
urllib3 1.25.8
virtualenv 16.7.9
wheel 0.34.2

You can install additional packages in glue-python if they are present in the requirements.txt used to build the attaching .whl .如果用于构建附加.whlrequirements.txt中存在其他包,您可以在glue-python 中安装.whl The whl file gets collected and installed before your script is kicked-off. whl文件在您的脚本启动之前被收集和安装。 I would also suggest you to look into Sagemaker Processing which is more easier for python based jobs.我还建议您研究 Sagemaker Processing,这对于基于 Python 的工作来说更容易。 Unlike serveless instance for glue-python shell, you are not limited to 16gb limit there.与用于glue-python shell 的无服务器实例不同,您不限于那里的16gb 限制。

I think the current answer is you cannot .我认为目前的答案是你不能 According to AWS Glue Documentation :根据AWS Glue 文档

Only pure Python libraries can be used.只能使用纯 Python 库。 Libraries that rely on C extensions, such as the pandas Python Data Analysis Library, are not yet supported.尚不支持依赖 C 扩展的库,例如 pandas Python 数据分析库。

But even when I try to include a normal python written library in S3, the Glue job failed because of some HDFS permission problem.但即使我尝试在 S3 中包含一个普通的 Python 编写的库,由于某些 HDFS 权限问题,Glue 作业也失败了。 If you find a way to solve this, please let me know as well.如果您找到解决此问题的方法,也请告诉我。

If you don't have pure python libraries and still want to use then you can use below script to use it in your Glue code:如果您没有纯 python 库但仍想使用,那么您可以使用以下脚本在您的 Glue 代码中使用它:

import os
import site
from setuptools.command import easy_install
install_path = os.environ['GLUE_INSTALLATION']
easy_install.main( ["--install-dir", install_path, "<library-name>"] )
reload(site)


import <installed library>

There is an update:有一个更新:

...You can now use Python shell jobs... ...Python shell jobs in AWS Glue support scripts that are compatible with Python 2.7 and come pre-loaded with libraries such as the Boto3, NumPy, SciPy, pandas, and others. ...您现在可以使用 Python shell 作业... ...AWS Glue 支持脚本中的 Python shell 作业与 Python 2.7 兼容并预加载了 Boto3、NumPy、SciPy、pandas 等库.

https://aws.amazon.com/about-aws/whats-new/2019/01/introducing-python-shell-jobs-in-aws-glue/ https://aws.amazon.com/about-aws/whats-new/2019/01/introducing-python-shell-jobs-in-aws-glue/

when you click run job you have a button Job parameters (optional) that is collapsed by default , when we click on it we have the following options which we can use to save the libraries in s3 and this works for me :当您单击运行作业时,您有一个默认折叠的作业参数按钮(可选),当我们单击它时,我们有以下选项可用于将库保存在 s3 中,这对我有用:

Python library path Python库路径

s3://bucket-name/folder-name/file-name s3://bucket-name/文件夹名/文件名

Dependent jars path依赖 jars 路径

s3://bucket-name/folder-name/file-name s3://bucket-name/文件夹名/文件名

Referenced files path s3://bucket-name/folder-name/file-name引用的文件路径 s3://bucket-name/folder-name/file-name

The picked answer is not longer true since 2019自 2019 年以来,选择的答案不再正确

awswrangler is what you need. awswrangler正是您所需要的。 It allows you to use pandas in glue and lambda它允许您在胶水和 lambda 中使用熊猫

https://github.com/awslabs/aws-data-wrangler https://github.com/awslabs/aws-data-wrangler

Install using AWS Lambda Layer使用 AWS Lambda 层安装

https://aws-data-wrangler.readthedocs.io/en/latest/install.html#setting-up-lambda-layer https://aws-data-wrangler.readthedocs.io/en/latest/install.html#setting-up-lambda-layer

Example: Typical Pandas ETL示例:典型的 Pandas ETL

import pandas
import awswrangler as wr

df = pandas.read_...  # Read from anywhere

# Typical Pandas, Numpy or Pyarrow transformation HERE!

wr.pandas.to_parquet(  # Storing the data and metadata to Data Lake
    dataframe=df,
    database="database",
    path="s3://...",
    partition_cols=["col_name"],
)

AWS Glue version 2.0 released on 2020 Aug now has pandas and numpy installed by default. 2020 年 8 月发布的 AWS Glue 2.0 版现在默认安装了 pandas 和 numpy。 See https://docs.aws.amazon.com/glue/latest/dg/reduced-start-times-spark-etl-jobs.html#reduced-start-times-new-features for detail.有关详细信息,请参阅https://docs.aws.amazon.com/glue/latest/dg/reduced-start-times-spark-etl-jobs.html#reduced-start-times-new-features

If you go to edit a job (or when you create a new one) there is an optional section that is collapsed called "Script libraries and job parameters (optional)".如果您要编辑作业(或在创建新作业时),则会有一个折叠的可选部分,称为“脚本库和作业参数(可选)”。 In there, you can specify an S3 bucket for Python libraries (as well as other things).在那里,您可以为 Python 库(以及其他东西)指定一个 S3 存储桶。 I haven't tried it out myself for that part yet, but I think that's what you are looking for.我自己还没有尝试过那部分,但我认为这就是你正在寻找的。

As of now, You can use Python extension modules and libraries with your AWS Glue ETL scripts as long as they are written in pure Python.到目前为止,您可以将 Python 扩展模块和库与您的 AWS Glue ETL 脚本一起使用,只要它们是用纯 Python 编写的。 C libraries such as pandas are not supported at the present time, nor are extensions written in other languages.目前不支持 Pandas 等 C 库,也不支持用其他语言编写的扩展。

AWS GLUE library/Dependency is little convoluted AWS GLUE 库/依赖关系有点复杂

there are basically three ways to add required packages基本上有三种方法可以添加所需的包

Approach 1方法一

  1. via AAWS console UI/JOB definition, below are few screens to help通过 AAWS 控制台 UI/JOB 定义,下面是几个屏幕可以提供帮助
    Action --> Edit Job操作 --> 编辑作业

    在此处输入图片说明

    then scroll all the way down and expand然后一直向下滚动并展开

    Security configuration, script libraries, and job parameters (optional)安全配置、脚本库和作业参数(可选)

    then add all your packages as .zip files to Python Library path (you need to add your .zip files to S3 then specify the path)然后将所有包作为.zip文件添加到 Python 库路径(您需要将 .zip 文件添加到 S3 然后指定路径)

    one catch here is you need to make sure your zip file must contain init .py in the root folder这里的一个问题是您需要确保您的 zip文件必须在根文件夹中包含init .py

    在此处输入图片说明

and also, if your package depends on another package then it will be very difficult to add those packages.而且,如果您的包依赖于另一个包,那么添加这些包将非常困难。

Approach 2方法二

programmatically installing your packages (Easy one)以编程方式安装您的软件包(简单的一个)

here is the path where you can install the required libraries to这是您可以安装所需库的路径

/home/spark/.local/lib/python3.7/site-packages/ /home/spark/.local/lib/python3.7/site-packages/

** **

/home/spark/.local/lib/python3.7/site-packages/ /home/spark/.local/lib/python3.7/site-packages/

** **

here is the example of installing the AWS package I have installed SAGE marker package here这是安装 AWS 包的示例我在这里安装了 SAGE 标记包

import site
from importlib import reload 
from setuptools.command import easy_install
# install_path = site.getsitepackages()[0]
install_path = '/home/spark/.local/lib/python3.7/site-packages/'
easy_install.main( ["--install-dir", install_path, "https://files.pythonhosted.org/packages/60/c7/126ad8e7dfbffaf9a5384ca6123da85db6c7b4b4479440ce88c94d2bb23f/sagemaker-2.3.0.tar.gz"] )
reload(site)

Approach 3. (Suggested and clean)方法 3.(建议和清洁)

under Security configuration, script libraries, and job parameters (optional) section to job parameters安全配置、脚本库和作业参数(可选)部分下的作业参数

add the required libraries with --additional-python-modules parameter you can specify as may packages as you need with comma separator使用--additional-python-modules参数添加所需的库,您可以根据需要使用逗号分隔符指定尽可能多的包在此处输入图片说明

happy to help很高兴能帮助你

Use Glue version 2 instead of version 3 Steps:使用 Glue 版本 2 而不是版本 3 步骤:

  1. Go to glue job and edit script with below code Go 使用以下代码粘合作业和编辑脚本

code:代码:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
import pandas as pd

args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)


excel_path= r"s3://input/employee.xlsx"
df_xl_op = pd.read_excel(excel_path,sheet_name = "Sheet1")
df=df_xl_op.applymap(str)
input_df = spark.createDataFrame(df)
input_df.printSchema()

job.commit()
  1. Save script保存脚本

  2. Goto Action - Edit Job - Select Glue version2 and set key value under security configuration Goto Action - Edit Job - Select Glue version2 并在安全配置下设置键值

    key: --additional-python-modules键:--additional-python-modules
    value: pandas==1.2.4,xlrd==1.2.0,numpy==1.20.1,fsspec==0.7.4值:pandas==1.2.4,xlrd==1.2.0,numpy==1.20.1,fsspec==0.7.4

  3. Save and run the job保存并运行作业

It will resolve your error and you will be able to read the excel file using pandas它将解决您的错误,您将能够使用 pandas 读取 excel 文件

If you want to integrate python modules into your AWS GLUE ETL job, you can do.如果您想将 Python 模块集成到您的 AWS GLUE ETL 作业中,您可以这样做。 You can use whatever Python Module you want.你可以使用任何你想要的 Python 模块。 Because Glue is nothing but serverless with Python run environment.因为 Glue 只不过是具有 Python 运行环境的无服务器。 SO all you need is to package the modules that your scrpt requires using pip install -t /path/to/your/dircetory .所以你所需要的就是使用pip install -t /path/to/your/dircetory打包你的 scrpt 需要的模块。 And then upload to your s3 bucket.然后上传到您的 s3 存储桶。 And while creating AWS Glue job, after pointing s3 scripts, temp location, if you go to advanced job parrameters option, you will see python_libraries option there.在创建 AWS Glue 作业时,在指向 s3 脚本、临时位置后,如果您转到高级作业参数选项,您将在那里看到 python_libraries 选项。 enter image description here You can just point that to python module packages that you uploaded to s3.在此处输入图像描述您可以将其指向上传到 s3 的 python 模块包。

In order to install a specific version (for instance, for AWS Glue python job), navigate to the website with python packages, for example to the page of package "pg8000" https://pypi.org/project/pg8000/1.12.5/#files为了安装特定版本(例如,对于 AWS Glue python 作业),导航到带有 python 包的网站,例如到包“pg8000”的页面https://pypi.org/project/pg8000/1.12。 5/#文件

Then select an appropriate version, copy the link to the file, and paste it into the snippet below:然后选择一个合适的版本,将链接复制到文件中,然后将其粘贴到以下代码段中:

import os
import site
from setuptools.command import easy_install
install_path = os.environ['GLUE_INSTALLATION']

easy_install.main( ["--install-dir", install_path, "https://files.pythonhosted.org/packages/83/03/10902758730d5cc705c0d1dd47072b6216edc652bc2e63a078b58c0b32e6/pg8000-1.12.5.tar.gz"] )
reload(site)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM