[英]Is there a way to publish custom metrics from AWS Glue jobs?
I'm using an AWS Glue job to move and transform data across S3 buckets, and I'd like to build custom accumulators to monitor the number of rows that I'm receiving and sending, along with other custom metrics.我正在使用 AWS Glue 作业跨 S3 存储桶移动和转换数据,我想构建自定义累加器来监控我接收和发送的行数以及其他自定义指标。 What is the best way to monitor these metrics?
监控这些指标的最佳方式是什么? According to this document: https://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html I can keep track of general metrics on my glue job but there doesn't seem to be a good way to send custom metrics through cloudwatch.
根据此文档: https://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html我可以跟踪胶水作业的一般指标,但没有似乎是通过 cloudwatch 发送自定义指标的好方法。
Consider enabling continuous logging on your AWS Glue Job.考虑在您的 AWS Glue 作业上启用连续日志记录。 This will allow you to do custom logging via.
这将允许您通过进行自定义日志记录。 CloudWatch.
云观察。 Custom logging can include information such as row count.
自定义日志记录可以包括行数等信息。
More specifically进一步来说
logger = glueContext.get_logger()
at the beginning of you Glue Joblogger = glueContext.get_logger()
logger.info("Custom logging message that will be sent to CloudWatch")
where you want to log information to CloudWatch.logger.info("Custom logging message that will be sent to CloudWatch")
您要将信息记录到 CloudWatch 的位置。 For example if I have a data frame named df
I could log the number of rows to CloudWatch by adding logger.info("Row count of df " + str(df.count()))
df
的数据框,我可以通过添加logger.info("Row count of df " + str(df.count()))
将行数记录到 CloudWatch Your log messages will be located under the CloudWatch log groups /aws-glue/jobs/logs-v2
under the log stream named glue_run_id
-driver
.您的日志消息将位于 CloudWatch 日志组
/aws-glue/jobs/logs-v2
下的日志 stream 名为glue_run_id
-driver
。
You can also reference the "Logging Application-Specific Messages Using the Custom Script Logger" section of the AWS documentation Enabling Continuous Logging for AWS Glue Jobs for more information on application specific logging.您还可以参考 AWS 文档为 AWS Glue 作业启用连续日志记录的“使用自定义脚本记录器记录应用程序特定消息”部分,了解有关应用程序特定日志记录的更多信息。
I have done lots of similar project like this, each micro batch can be:我做过很多类似的项目,每个微批次可以是:
Your use case is can be break down into three question:您的用例可以分为三个问题:
task_id
task_id
metrics
for your task, you need to define a simple dictionary structure for this metrics datametrics
,你需要为这个指标数据定义一个简单的字典结构In some business use case, you also need to store status information to track each of the input, are they succeeded?在某些业务用例中,您还需要存储状态信息以跟踪每个输入,它们是否成功? failed?
失败的? in-progress?
进行中? stuck?
卡住? and you may want to control retry, and concurrency control (avoid multiple worker working on the same input)
并且您可能想要控制重试和并发控制(避免多个工作人员处理同一输入)
DynamoDB is the perfect backend for this type of use case. DynamoDB 是此类用例的完美后端。 It is a super fast, no ops, pay as you go, automatically scaling key-value store.
这是一个超级快,没有操作,随你付 go,自动缩放键值存储。
There's a Python library that implemented this pattern https://github.com/MacHu-GWU/pynamodb_mate-project/blob/master/examples/patterns/status-tracker.ipynb有一个 Python 库实现了这个模式https://github.com/MacHu-GWU/pynamodb_mate-project/blob/master/examples/patterns/status-tracker.ipynb
Here's an example:这是一个例子:
def glue_job() -> dict:
...
return your_metrics
given an input, calculate the task id
identifier, then you just need给定一个输入,计算
task id
标识符,那么你只需要
tracker = Tracker.new(task_id)
# start the job, it will succeed
with tracker.start_job():
# do some work
your_metrics = glue_job()
# save your metrics in dynamodb
tracker.set_data(your_metrics)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.