简体   繁体   English

如何在 AWS Glue 中使用外部 python 库?

[英]How can I use an external python library in AWS Glue?

First stack overflow question here.这里的第一个堆栈溢出问题。 Hope I do this correctly:希望我能正确地做到这一点:

I need to use an external python library in AWS glue.我需要在 AWS 胶水中使用外部 python 库。 "Openpyxl" is the name of the library. “Openpyxl”是库的名称。

I follow these directions: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-libraries.html我遵循这些指示: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-libraries.html

However, after I have my zip file saved in the correct s3 location and point my glue job to that location, I'm not sure what to actually write in the script.但是,在我将 zip 文件保存在正确的 s3 位置并将胶水作业指向该位置之后,我不确定在脚本中实际写什么。

I tried your typical Import openpyxl , but that just returns the following error:我尝试了典型的Import openpyxl ,但只返回以下错误:

ImportError: No module named openpyxl

Obviously I don't know what to do here - also relatively new to programming so I'm not sure if this is a noob question or what.显然我不知道在这里做什么 - 编程也相对较新,所以我不确定这是一个菜鸟问题还是什么。 Thanks in advance!提前致谢!

It depends if the job is Spark or Python Shell.这取决于作业是 Spark 还是 Python Shell。 For Spark you just need to zip the library and then when you point the job to the library S3 path, the job will import it.对于 Spark,您只需要 zip 库,然后当您将作业指向库 S3 路径时,作业将导入它。 You just need to make sure that the zip contains this file: __init__.py您只需要确保 zip 包含此文件: __init__.py

For example, for the library you are trying to import, if you download it from https://pypi.org/project/openpyxl/#files , you can zip the folder openpyxl inside the openpyxl-3.0.0.tar.gz , and store it in S3.例如,对于您尝试导入的库,如果您从https://pypi.org/project/openpyxl/#files下载它,则可以 zip 文件夹中的openpyxl openpyxl-3.0.0.tar.gz ,并将其存储在 S3 中。


On the other hand, if it is a Python Shell job, a zip file will not work.另一方面,如果是 Python Shell 作业,则 zip 文件将不起作用。 You will need to create an egg file from the library.您将需要从库中创建一个egg文件。 If you are using this version openpyxl-3.0.0, then you can download it from that same website, extract everything, and run the command python setup.py bdist_egg or python3 instead of python if you use python3 instead.如果您使用的是这个版本的 openpyxl-3.0.0,那么您可以从同一个网站下载它,解压缩所有内容,然后运行命令python setup.py bdist_eggpython3而不是python如果您使用 python3 而不是。

This will generate an egg file inside dist folder which is also generated.这将在dist文件夹中生成一个egg文件,该文件也会生成。 You just need to put that egg file in S3 and point the Glue Job Python Libraries to that path.您只需将该 egg 文件放入 S3 并将 Glue Job Python Libraries 指向该路径。

If you already have the library and for some reason you don't have the setup.py , then you must create it in order to run the command to generate the egg file.如果您已经拥有该库并且由于某种原因您没有setup.py ,那么您必须创建它才能运行命令以生成 egg 文件。 Please refer to http://www.blog.pythonlibrary.org/2012/07/12/python-101-easy_install-or-how-to-create-eggs/ .请参考http://www.blog.pythonlibrary.org/2012/07/12/python-101-easy_install-or-how-to-create-eggs/ There you can find an example.在那里你可以找到一个例子。

You can now (as of Glue version 2) directly add external libraries using --additional-python-modules parameter.您现在可以(从 Glue 版本 2 开始)使用--additional-python-modules参数直接添加外部库。

For example to update or to add a new scikit-learn module use the following key/value:例如,要更新或添加新的 scikit-learn 模块,请使用以下键/值:

"--additional-python-modules", "scikit-learn==0.21.3" . "--additional-python-modules", "scikit-learn==0.21.3"

More details could be found in the docs .更多细节可以在docs中找到。

You may use following boilerplate code to use extra files as well as external libraries - https://github.com/fatangare/aws-python-shell-deploy您可以使用以下样板代码来使用额外文件以及外部库 - https://github.com/fatangare/aws-python-shell-deploy

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM