简体   繁体   English

使用 Pandas AWS Glue Python Shell

[英]Using Pandas AWS Glue Python Shell Jobs

The AWS Documentation https://docs.aws.amazon.com/glue/latest/dg/add-job-python.html AWS 文档https://docs.aws.amazon.com/glue/latest/dg/add-job-python.html

mentions that提到

The environment for running a Python shell job supports the following libraries:运行 Python shell 作业的环境支持以下库:

... ...

pandas (required to be installed via the python setuptools configuration, setup.py) pandas(需要通过python setuptools配置安装,setup.py)

But it does not mention how to make the install.但它没有提到如何进行安装。

How can I use Pandas in a AWS Glue Python Shell Jobs?如何在 AWS Glue Python Shell 作业中使用 Pandas?

Just to clarify Sandeep's answer, here is what worked for me只是为了澄清桑迪普的回答,这对我有用

1/ Ignore AWS doc 1/ 忽略 AWS 文档

2/ Create a setup.py file containing: 2/ 创建一个 setup.py 文件,其中包含:

from setuptools import setup

setup(name="pandasmodule",
        version="0.1",
        packages=[],
        install_requires=['pandas==0.25.1']
    )

3/ Run this command in the folder containing the file: 3/ 在包含该文件的文件夹中运行此命令:

python setup.py bdist_wheel

4/ Upload the.whl file to s3 4/ 上传.whl文件到s3

5/ Configure the "Python lib path" in your Glue ETL Job to the s3 path 5/ 将 Glue ETL 作业中的“Python lib 路径”配置为 s3 路径

You can now use " import pandas as pd " in your Glue ETL Job您现在可以在 Glue ETL 作业中使用“ import pandas as pd

  1. Goto https://docs.aws.amazon.com/glue/latest/dg/add-job-python.html#create-python-extra-library .转到https://docs.aws.amazon.com/glue/latest/dg/add-job-python.html#create-python-extra-library Check section To create a Python.egg or.whl file for 'how to create setup file for python shell job'检查部分要创建 Python.egg 或 .whl 文件以了解“如何为 python shell 作业创建设置文件”
  2. In setup.py file, add line install_requires=['pandas==0.25.1'] :在 setup.py 文件中,添加行install_requires=['pandas==0.25.1']
 setup(name="<module name>", version="0.1", packages=['<package name if any or ignore>'], install_requires=['pandas==0.25.1'] )

I also wrote small shell script to deploy python shell job without manual steps to create egg file and upload to s3 and deploy via cloudformation.我还编写了小型 shell 脚本来部署 python shell 作业,无需手动步骤来创建 egg 文件并上传到 s3 并通过 cloudformation 部署。 Script does all automatically.脚本会自动执行所有操作。 You may find code at https://github.com/fatangare/aws-python-shell-deploy您可以在https://github.com/fatangare/aws-python-shell-deploy找到代码

Using the Glue Python Shell, the following script works directly for pandas:使用 Glue Python Shell,以下脚本直接适用于 pandas:

from setuptools import setup

setup(name="pandasmodule",
        version="0.1",
        packages=[],
        install_requires=['pandas==0.25.1']
    )

# use pandas
import numpy as np
import pandas as pd

s = pd.Series([1, 3, 5, np.nan, 6, 8])

print(s)

No need to do anything, just import pandas and start using it.不需要做任何事情,只需导入 pandas 并开始使用它。

AWS Glue 2.0 supports pandas—1.0.1 https://docs.aws.amazon.com/glue/latest/dg/reduced-start-times-spark-etl-jobs.html AWS Glue 2.0 支持 pandas—1.0.1 https://docs.aws.amazon.com/glue/latest/dg/reduced-start-times-spark-etl-jobs.html

so in your script you can simply write: import pandas.所以在你的脚本中你可以简单地写:import pandas。 If you want to use other python module that is not provided in Glue, you can download.whl or.zip ->store it in S3 -> place path of it in glue job in "Python library path" and glue during a job run will do a pip install "yourmodule"如果您想使用 Glue 中未提供的其他 python 模块,您可以下载.whl 或.zip -> 将其存储在 S3 -> 将其路径放在“Python 库路径”中的胶水作业中并在作业运行期间胶水将执行 pip 安装“yourmodule”

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM