简体繁体 English

AWS Glue和Python集成

[英]AWS Glue and Python Integration

原文 2019-02-27 00:54:25 2 1 python/ amazon-web-services/ aws-glue

I have a data normalization process that exists in python but now needs to scale. 我有一个存在于python中的数据规范化过程，但现在需要扩展。 This process currently runs via a job-specific configuration file containing a list of transforming functions that need to be applied to a table of data for that job. 当前，此过程通过特定于作业的配置文件运行，该配置文件包含需要应用于该作业的数据表的一系列转换功能。 The transforming functions are mutually exclusive and can be applied in any order. 转换功能是互斥的，可以按任何顺序应用。 All transforming functions live in a library and only get imported and applied to the data when they are listed in the job-specific configuration file. 所有转换功能都存在于库中，并且只有在特定于作业的配置文件中列出时，才导入并应用于数据。 Different jobs will have different required functions listed in the configuration for that job, but all functions will exist in the library. 不同的作业将在该作业的配置中列出不同的必需功能，但是所有功能将存在于库中。

In the most general sense, how might a process like this be handled by AWS Glue? 从最一般的意义上讲，AWS Glue如何处理这样的过程？ I don't need a technical example as much as a high level overview. 我不需要一个技术示例，而只是一个高级概述。 Simply looking to be aware of some options. 只是想知道一些选择。 Thanks! 谢谢！

1 个解决方案

The single most important thing you need to consider when using AWS glue is that is a serverless spark-based environment with extensions. 使用AWS粘合时，您需要考虑的最重要的一件事情就是带有扩展的无服务器基于 Spark 的环境。 That means you will need to adapt your script to be pySpark-like. 这意味着您将需要使脚本适应pySpark风格。 If you are OK with that, then you can use external python libraries by following the instructions at AWS Glue Documentation 如果可以，那么您可以按照AWS Glue文档中的说明使用外部python库

If you already have your scripts running and you don't feel like using Spark, you can always consider the AWS Data Pipeline . 如果您已经在运行脚本，并且不想使用Spark，则可以随时考虑使用AWS Data Pipeline 。 It's a service to run data transforms in more ways than just Spark. 这是一项服务，可以通过多种方式运行数据转换，而不仅仅是Spark。 On the downside, AWS Data Pipeline is Task-driven, not Data-driven, which means no catalog or schema management. 不利的一面是，AWS Data Pipeline是任务驱动的，而不是数据驱动的，这意味着没有目录或架构管理。

If you want to use AWS Data Pipeline with Python is not obvious when you read the documentation, but the process would be basically staging a shell file into S3 with the instructions to set up your python environment and to invoke the script. 阅读文档时，如果要在Python上使用AWS Data Pipeline并不明显，但是该过程基本上是将shell文件分段放置到S3中，并附有设置python环境和调用脚本的说明。 Then you configure scheduling for the pipeline and AWS will take care of starting the virtual machines whenever needed and stopping afterwards. 然后，您为管道配置调度，AWS会在需要时启动虚拟机，然后再停止。 You have a good post at stackoverflow about this 您在stackoverflow上有一篇不错的文章