Is there a way to have the output from Dataproc Spark jobs sent to Google Cloud logging? As explained in the Dataproc docs the output from the job driver (the master for a Spark job) is available under Dataproc->Jobs in the console. There are two reasons I would like to have the logs in Cloud Logging as well:
Currently the only output from Dataproc that shows up in Cloud Logging is log items from yarn-yarn-nodemanager-* and container_*.stderr. Output from my application code is shown in Dataproc->Jobs but not in Cloud Logging, and it's only the output from the Spark master, not the executors.
tl;dr
This is not natively supported now but will be natively supported in a future version of Cloud Dataproc . That said, there is a manual workaround in the interim.
Workaround
Cloud Dataproc clusters use fluentd to collect and forward logs to Cloud Logging. The configuration of fluentd is why you see some logs forwarded and not others. Therefore, the simple workaround (until Cloud Dataproc has support for job details in Cloud Logging) is to modify the flientd configuration. The configuration file for fluentd on a cluster is at:
/etc/google-fluentd/google-fluentd.conf
There are two things to gather additional details which will be easiest:
56
has the files on my cluster) Once you edit the configuration, you'll need to restart the google-fluentd
service:
/etc/init.d/google-fluentd restart
Finally, depending on your needs, you may or may not need to do this across all nodes on your cluster. Based on your use case, it sounds like you could probably just change your master node and be set.
You can use the dataproc initialization actions for stackdriver for this:
gcloud dataproc clusters create <CLUSTER_NAME> \
--initialization-actions gs://<GCS_BUCKET>/stackdriver.sh \
--scopes https://www.googleapis.com/auth/monitoring.write
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.