简体繁体 English

如何在 AWS Glue 中导入 Spark 包？

[英]How to import Spark packages in AWS Glue?

原文 2018-11-19 20:28:42 0 3 amazon-web-services/ apache-spark/ pyspark/ aws-glue

I would like to use the GrameFrames package, if I were to run pyspark locally I would use the command:我想使用 GrameFrames 包，如果我要在本地运行 pyspark，我将使用以下命令：

~/hadoop/spark-2.3.1-bin-hadoop2.7/bin/pyspark --packages graphframes:graphframes:0.6.0-spark2.3-s_2.11

But how would I run a AWS Glue script with this package?但是我将如何使用这个包运行 AWS Glue 脚本？ I found nothing in the documentation...我在文档中什么也没找到...

3 个解决方案

You can provide a path to extra libraries packaged into zip archives located in s3.您可以提供打包到 s3 中的 zip 存档中的额外库的路径。

Please check out this doc for more details请查看此文档以获取更多详细信息

It's possible to using graphframes as follows:可以按如下方式使用图形框架：

Download the graphframes python library package file eg from here .下载graphframes python 库包文件，例如从这里。 Unzip the .tar.gz and then re-archive to a .zip .解压缩.tar.gz ，然后重新存档到.zip 。 Put somewhere in s3 that your glue job has access to放置在 s3 中您的胶水作业可以访问的某个位置

When setting up your glue job:设置胶水作业时：

Make sure that your Python Library Path references the zip file确保您的 Python 库路径引用了 zip 文件
For job parameters, you need {"--conf": "spark.jars.packages=graphframes:graphframes:0.6.0-spark2.3-s_2.11"}对于作业参数，您需要{"--conf": "spark.jars.packages=graphframes:graphframes:0.6.0-spark2.3-s_2.11"}

Every one looking for an answer please read this comment..每个寻找答案的人请阅读此评论..

In order to use an external package in AWS Glue pySpark or Python-shell:为了在 AWS Glue pySpark 或 Python-shell 中使用外部包：

1) Clone the repo from follwing url.. https://github.com/bhavintandel/py-packager/tree/master 1) 从以下 url 克隆 repo .. https://github.com/bhavintandel/py-packager/tree/master

git clone git@github.com:bhavintandel/py-packager.git git 克隆 git@github.com:bhavintandel/py-packager.git

cd py-packager cd py-packager

2) Add your required package under requirements.txt. 2) 在 requirements.txt 下添加你需要的包。 For ex.,例如，

pygeohash pygeohash

Update the version and project name under setup.py.更新 setup.py 下的版本和项目名称。 For ex.,例如，

VERSION = "0.1.0"版本 = "0.1.0"

PACKAGE_NAME = "dependencies" PACKAGE_NAME = "依赖项"

3) Run the follwing "command1" to create .zip package for pyspark OR "command2" to create egg files for python-shell.. 3）运行以下“command1”为pyspark创建.zip包或“command2”为python-shell创建egg文件..

command1:命令 1：

sudo make build_zip须藤 make build_zip

Command2:命令 2：

sudo make bdist_egg须藤制作 bdist_egg

Above commands will generate packae in dist folder.以上命令将在 dist 文件夹中生成包。

4) Finally upload this pakcage from dist directory to S3 bucket. 4) 最后将此包从 dist 目录上传到 S3 存储桶。 Then goto AWS Glue Job Console, edit job, find script libraries option, click on folder icon of "python library path" .. then select your s3 path.然后转到 AWS Glue 作业控制台，编辑作业，找到脚本库选项，单击“python 库路径”的文件夹图标 .. 然后选择您的 s3 路径。

finally use in your glue script:最后在你的胶水脚本中使用：

import pygeohash as pgh将 pygeohash 导入为 pgh

Done!完毕！