How to configure an alerting policy for failed Dataproc Batch?

Question

I want to alert on failure of any serverless dataproc job. I think that I may need to create a log based metric and then an alerting policy based on that metric.

I tried creating an alerting policy with the filter below:

      filter = "metric.type=\"logging.googleapis.com/log_entry_count\" resource.type=\"cloud_dataproc_batch\" metric.label.\"severity\"=\"ERROR\""

I was expecting an alert to trigger upon failure, but this metric does not seem to be active.

Answer 1

I tried to create a dataproc jobs standard procedure and custom procedure, and I followed this public documentation Run an Apache Spark batch workload

From the step, while creating dataproc job. Try to change the value of the Arguments to 0, instead to 1000. to received an a ERROR to cloud logging.

From Cloud Logging I tried to use this filter below:

resource.type="audited_resource"
resource.labels.method="google.cloud.dataproc.v1.BatchController.CreateBatch"
resource.labels.service="dataproc.googleapis.com"
severity = "ERROR"

and successfully filter the audited_resource details from Cloud Logging with severity: "ERROR"

{
insertId: "efuxrvd7fs2"
logName: "projects/t**h-********ra-350512/logs/cloudaudit.googleapis.com%2Factivity"
operation: {3}
protoPayload: {
@type: "type.googleapis.com/google.cloud.audit.AuditLog"
authenticationInfo: {2}
methodName: "google.cloud.dataproc.v1.BatchController.CreateBatch"
requestMetadata: {2}
resourceName: "projects/t**h-********ra-350512/locations/us-central1/batches/batch-4e0c"
serviceName: "dataproc.googleapis.com"
status: {2}}
receiveTimestamp: "2022-11-14T23:11:31.514920377Z"
resource: {2}
severity: "ERROR"
timestamp: "2022-11-14T23:11:31.488430Z"
}

And also try to remove the "severity = ERROR" code to check severity: "NOTICE" in the Cloud logging

resource.type="audited_resource"
resource.labels.method="google.cloud.dataproc.v1.BatchController.CreateBatch"
resource.labels.service="dataproc.googleapis.com"

Example Output:

{
insertId: "-1xjqtwdo1kq"
logName: "projects/t**h-*******rra-350512/logs/cloudaudit.googleapis.com%2Factivity"
operation: {3}
protoPayload: {
@type: "type.googleapis.com/google.cloud.audit.AuditLog"
authenticationInfo: {2}
authorizationInfo: [1]
methodName: "google.cloud.dataproc.v1.BatchController.CreateBatch"
request: {4}
requestMetadata: {4}
resourceLocation: {1}
resourceName: "projects/t**h-j*******a-350512/locations/us-central1/batches/batch-4e0c"
serviceName: "dataproc.googleapis.com"
status: {0}}
receiveTimestamp: "2022-11-14T23:06:27.771245352Z"
resource: {2}
severity: "NOTICE"
timestamp: "2022-11-14T23:06:26.339799Z"

Answer 2

You can create a custom log based metric that filters your Dataproc jobs errors, then create an alerting policy on this log based metric , example with Terraform :

alerting policy log based metric terraform

For your log based metric , you have to add your filter, example:

"filter": "resource.type=\"cloud_dataproc_cluster\" AND severity=ERROR AND jsonPayload.message =~ \"mytext.*\"",

How to configure an alerting policy for failed Dataproc Batch?

Question

2 answers

solution1
0 2022-11-14 23:30:23

solution2
0 2022-11-15 11:02:43

How to configure an alerting policy for failed Dataproc Batch?

Question

2 answers

solution1 0 2022-11-14 23:30:23

solution2 0 2022-11-15 11:02:43

solution1
0 2022-11-14 23:30:23

solution2
0 2022-11-15 11:02:43