有没有办法从 AWS Glue 作业发布自定义指标？

Question

I'm using an AWS Glue job to move and transform data across S3 buckets, and I'd like to build custom accumulators to monitor the number of rows that I'm receiving and sending, along with other custom metrics.我正在使用 AWS Glue 作业跨 S3 存储桶移动和转换数据，我想构建自定义累加器来监控我接收和发送的行数以及其他自定义指标。 What is the best way to monitor these metrics?监控这些指标的最佳方式是什么？ According to this document: https://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html I can keep track of general metrics on my glue job but there doesn't seem to be a good way to send custom metrics through cloudwatch.根据此文档： https://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html我可以跟踪胶水作业的一般指标，但没有似乎是通过 cloudwatch 发送自定义指标的好方法。

Answer 1

Consider enabling continuous logging on your AWS Glue Job.考虑在您的 AWS Glue 作业上启用连续日志记录。 This will allow you to do custom logging via.这将允许您通过进行自定义日志记录。 CloudWatch.云观察。 Custom logging can include information such as row count.自定义日志记录可以包括行数等信息。

More specifically进一步来说

Enable continuous logging for you Glue Job 为您的 Glue 作业启用连续日志记录
Add logger = glueContext.get_logger() at the beginning of you Glue Job在 Glue Job 的开头添加logger = glueContext.get_logger()
Add logger.info("Custom logging message that will be sent to CloudWatch") where you want to log information to CloudWatch.添加logger.info("Custom logging message that will be sent to CloudWatch")您要将信息记录到 CloudWatch 的位置。 For example if I have a data frame named df I could log the number of rows to CloudWatch by adding logger.info("Row count of df " + str(df.count()))例如，如果我有一个名为df的数据框，我可以通过添加logger.info("Row count of df " + str(df.count()))将行数记录到 CloudWatch

Your log messages will be located under the CloudWatch log groups /aws-glue/jobs/logs-v2 under the log stream named glue_run_id -driver .您的日志消息将位于 CloudWatch 日志组/aws-glue/jobs/logs-v2下的日志 stream 名为glue_run_id -driver 。

You can also reference the "Logging Application-Specific Messages Using the Custom Script Logger" section of the AWS documentation Enabling Continuous Logging for AWS Glue Jobs for more information on application specific logging.您还可以参考 AWS 文档为 AWS Glue 作业启用连续日志记录的“使用自定义脚本记录器记录应用程序特定消息”部分，了解有关应用程序特定日志记录的更多信息。

Answer 2

I have done lots of similar project like this, each micro batch can be:我做过很多类似的项目，每个微批次可以是：

a file or a bunch of file一个文件或一堆文件
a time interval of data from API从 API 开始的数据时间间隔
a partition of records from database来自数据库的记录分区
etc... ETC...

Your use case is can be break down into three question:您的用例可以分为三个问题：

given a bunch of input, how could you define a task_id给定一堆输入，你如何定义一个task_id
how you want to define the metrics for your task, you need to define a simple dictionary structure for this metrics data你想如何为你的任务定义metrics ，你需要为这个指标数据定义一个简单的字典结构
find a backend data store to store the metrics data找到一个后端数据存储来存储指标数据
find a way to query the metrics data找到查询指标数据的方法

In some business use case, you also need to store status information to track each of the input, are they succeeded?在某些业务用例中，您还需要存储状态信息以跟踪每个输入，它们是否成功？ failed?失败的？ in-progress?进行中？ stuck?卡住？ and you may want to control retry, and concurrency control (avoid multiple worker working on the same input)并且您可能想要控制重试和并发控制（避免多个工作人员处理同一输入）

DynamoDB is the perfect backend for this type of use case. DynamoDB 是此类用例的完美后端。 It is a super fast, no ops, pay as you go, automatically scaling key-value store.这是一个超级快，没有操作，随你付 go，自动缩放键值存储。

There's a Python library that implemented this pattern https://github.com/MacHu-GWU/pynamodb_mate-project/blob/master/examples/patterns/status-tracker.ipynb有一个 Python 库实现了这个模式https://github.com/MacHu-GWU/pynamodb_mate-project/blob/master/examples/patterns/status-tracker.ipynb

Here's an example:这是一个例子：

put your glue ETL job main logic in a function:将胶水 ETL 作业的主要逻辑放在 function 中：

def glue_job() -> dict:
   ...
   return your_metrics

given an input, calculate the task id identifier, then you just need给定一个输入，计算task id标识符，那么你只需要

tracker = Tracker.new(task_id)

# start the job, it will succeed
with tracker.start_job():
    # do some work
    your_metrics = glue_job()
    # save your metrics in dynamodb
    tracker.set_data(your_metrics)

有没有办法从 AWS Glue 作业发布自定义指标？

问题描述

2 个解决方案

解决方案1
0 2021-06-08 15:54:47

解决方案2
0 2023-01-04 17:11:21

put your glue ETL job main logic in a function:将胶水 ETL 作业的主要逻辑放在 function 中：

有没有办法从 AWS Glue 作业发布自定义指标？

问题描述

2 个解决方案

解决方案1 0 2021-06-08 15:54:47

解决方案2 0 2023-01-04 17:11:21

put your glue ETL job main logic in a function:将胶水 ETL 作业的主要逻辑放在 function 中：

解决方案1
0 2021-06-08 15:54:47

解决方案2
0 2023-01-04 17:11:21