简体   繁体   中英

Is there a way to publish custom metrics from AWS Glue jobs?

I'm using an AWS Glue job to move and transform data across S3 buckets, and I'd like to build custom accumulators to monitor the number of rows that I'm receiving and sending, along with other custom metrics. What is the best way to monitor these metrics? According to this document: https://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html I can keep track of general metrics on my glue job but there doesn't seem to be a good way to send custom metrics through cloudwatch.

Consider enabling continuous logging on your AWS Glue Job. This will allow you to do custom logging via. CloudWatch. Custom logging can include information such as row count.

More specifically

  1. Enable continuous logging for you Glue Job
  2. Add logger = glueContext.get_logger() at the beginning of you Glue Job
  3. Add logger.info("Custom logging message that will be sent to CloudWatch") where you want to log information to CloudWatch. For example if I have a data frame named df I could log the number of rows to CloudWatch by adding logger.info("Row count of df " + str(df.count()))

Your log messages will be located under the CloudWatch log groups /aws-glue/jobs/logs-v2 under the log stream named glue_run_id -driver .

You can also reference the "Logging Application-Specific Messages Using the Custom Script Logger" section of the AWS documentation Enabling Continuous Logging for AWS Glue Jobs for more information on application specific logging.

I have done lots of similar project like this, each micro batch can be:

  1. a file or a bunch of file
  2. a time interval of data from API
  3. a partition of records from database
  4. etc...

Your use case is can be break down into three question:

  1. given a bunch of input, how could you define a task_id
  2. how you want to define the metrics for your task, you need to define a simple dictionary structure for this metrics data
  3. find a backend data store to store the metrics data
  4. find a way to query the metrics data

In some business use case, you also need to store status information to track each of the input, are they succeeded? failed? in-progress? stuck? and you may want to control retry, and concurrency control (avoid multiple worker working on the same input)

DynamoDB is the perfect backend for this type of use case. It is a super fast, no ops, pay as you go, automatically scaling key-value store.

There's a Python library that implemented this pattern https://github.com/MacHu-GWU/pynamodb_mate-project/blob/master/examples/patterns/status-tracker.ipynb

Here's an example:

put your glue ETL job main logic in a function:

def glue_job() -> dict:
   ...
   return your_metrics

given an input, calculate the task id identifier, then you just need

tracker = Tracker.new(task_id)

# start the job, it will succeed
with tracker.start_job():
    # do some work
    your_metrics = glue_job()
    # save your metrics in dynamodb
    tracker.set_data(your_metrics)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM