简体   繁体   中英

How to configure an alerting policy for failed Dataproc Batch?

I want to alert on failure of any serverless dataproc job. I think that I may need to create a log based metric and then an alerting policy based on that metric.

I tried creating an alerting policy with the filter below:

      filter = "metric.type=\"logging.googleapis.com/log_entry_count\" resource.type=\"cloud_dataproc_batch\" metric.label.\"severity\"=\"ERROR\""

I was expecting an alert to trigger upon failure, but this metric does not seem to be active.

I tried to create a dataproc jobs standard procedure and custom procedure, and I followed this public documentation Run an Apache Spark batch workload

From the step, while creating dataproc job. Try to change the value of the Arguments to 0, instead to 1000. to received an a ERROR to cloud logging.

From Cloud Logging I tried to use this filter below:

resource.type="audited_resource"
resource.labels.method="google.cloud.dataproc.v1.BatchController.CreateBatch"
resource.labels.service="dataproc.googleapis.com"
severity = "ERROR"

and successfully filter the audited_resource details from Cloud Logging with severity: "ERROR"

{
insertId: "efuxrvd7fs2"
logName: "projects/t**h-********ra-350512/logs/cloudaudit.googleapis.com%2Factivity"
operation: {3}
protoPayload: {
@type: "type.googleapis.com/google.cloud.audit.AuditLog"
authenticationInfo: {2}
methodName: "google.cloud.dataproc.v1.BatchController.CreateBatch"
requestMetadata: {2}
resourceName: "projects/t**h-********ra-350512/locations/us-central1/batches/batch-4e0c"
serviceName: "dataproc.googleapis.com"
status: {2}}
receiveTimestamp: "2022-11-14T23:11:31.514920377Z"
resource: {2}
severity: "ERROR"
timestamp: "2022-11-14T23:11:31.488430Z"
}

And also try to remove the "severity = ERROR" code to check severity: "NOTICE" in the Cloud logging

resource.type="audited_resource"
resource.labels.method="google.cloud.dataproc.v1.BatchController.CreateBatch"
resource.labels.service="dataproc.googleapis.com"

Example Output:

{
insertId: "-1xjqtwdo1kq"
logName: "projects/t**h-*******rra-350512/logs/cloudaudit.googleapis.com%2Factivity"
operation: {3}
protoPayload: {
@type: "type.googleapis.com/google.cloud.audit.AuditLog"
authenticationInfo: {2}
authorizationInfo: [1]
methodName: "google.cloud.dataproc.v1.BatchController.CreateBatch"
request: {4}
requestMetadata: {4}
resourceLocation: {1}
resourceName: "projects/t**h-j*******a-350512/locations/us-central1/batches/batch-4e0c"
serviceName: "dataproc.googleapis.com"
status: {0}}
receiveTimestamp: "2022-11-14T23:06:27.771245352Z"
resource: {2}
severity: "NOTICE"
timestamp: "2022-11-14T23:06:26.339799Z"

You can create a custom log based metric that filters your Dataproc jobs errors, then create an alerting policy on this log based metric , example with Terraform :

alerting policy log based metric terraform

For your log based metric , you have to add your filter, example:

"filter": "resource.type=\"cloud_dataproc_cluster\" AND severity=ERROR AND jsonPayload.message =~ \"mytext.*\"",

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM