简体   繁体   中英

How can I use an external python library in AWS Glue?

First stack overflow question here. Hope I do this correctly:

I need to use an external python library in AWS glue. "Openpyxl" is the name of the library.

I follow these directions: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-libraries.html

However, after I have my zip file saved in the correct s3 location and point my glue job to that location, I'm not sure what to actually write in the script.

I tried your typical Import openpyxl , but that just returns the following error:

ImportError: No module named openpyxl

Obviously I don't know what to do here - also relatively new to programming so I'm not sure if this is a noob question or what. Thanks in advance!

It depends if the job is Spark or Python Shell. For Spark you just need to zip the library and then when you point the job to the library S3 path, the job will import it. You just need to make sure that the zip contains this file: __init__.py

For example, for the library you are trying to import, if you download it from https://pypi.org/project/openpyxl/#files , you can zip the folder openpyxl inside the openpyxl-3.0.0.tar.gz , and store it in S3.


On the other hand, if it is a Python Shell job, a zip file will not work. You will need to create an egg file from the library. If you are using this version openpyxl-3.0.0, then you can download it from that same website, extract everything, and run the command python setup.py bdist_egg or python3 instead of python if you use python3 instead.

This will generate an egg file inside dist folder which is also generated. You just need to put that egg file in S3 and point the Glue Job Python Libraries to that path.

If you already have the library and for some reason you don't have the setup.py , then you must create it in order to run the command to generate the egg file. Please refer to http://www.blog.pythonlibrary.org/2012/07/12/python-101-easy_install-or-how-to-create-eggs/ . There you can find an example.

You can now (as of Glue version 2) directly add external libraries using --additional-python-modules parameter.

For example to update or to add a new scikit-learn module use the following key/value:

"--additional-python-modules", "scikit-learn==0.21.3" .

More details could be found in the docs .

You may use following boilerplate code to use extra files as well as external libraries - https://github.com/fatangare/aws-python-shell-deploy

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM