简体繁体 English

如何在 Kubernetes 中将 Spark 的 JVM 指标发送到 Prometheus

[英]How to send JVM metrics of Spark to Prometheus in Kubernetes

原文 2020-06-16 21:37:31 6 1 apache-spark/ kubernetes/ prometheus/ spark-operator

I am using the Spark operator to run Spark on Kubernetes.我正在使用 Spark 运算符在 Kubernetes 上运行 Spark。 ( https://github.com/GoogleCloudPlatform/spark-on-k8s-operator ) （ https://github.com/GoogleCloudPlatform/spark-on-k8s-operator ）

I am trying to run a Java agent in Spark driver and executor pods and send the metrics through a Kubernetes service to Prometheus operator.我正在尝试在 Spark 驱动程序和执行程序 pod 中运行 Java 代理，并通过 Kubernetes 服务将指标发送给 Prometheus 操作员。

I am using this example https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/examples/spark-pi-prometheus.yaml我正在使用这个例子https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/examples/spark-pi-prometheus.yaml

Java agent is exposing the metrics on port 8090 for a short time (I can validate that with port-forwarding kubctl port-forward < spark-driver-pod-name > 8090:8090 ), also the service is also exposing the metrics for few mins ( can validate that with port-forwarding kubctl port-forward svc/< spark-service-name > 8090:8090 ). Java 代理在短时间内暴露了端口 8090 上的指标（我可以通过端口转发 kubctl port-forward < spark-driver-pod-name > 8090:8090 验证），该服务也暴露了少数指标分钟（可以通过端口转发 kubctl port-forward svc/< spark-service-name > 8090:8090 验证）。

Promethues is able to register these pod's URL in the prometheus, but when it is trying to scrape the metrics(runs for every 30 seconds), the pod's URL is down. Promethues 能够在 prometheus 中注册这些 pod 的 URL，但是当它尝试抓取指标（每 30 秒运行一次）时，pod 的 URL 已关闭。

How can i make the Java agent JMX exporter to run long, until the driver and executors completed the job.如何使 Java 代理 JMX 导出器长时间运行，直到驱动程序和执行程序完成工作。 could you please guide or help me here, who have come across this scenario before?你能在这里指导或帮助我吗，以前谁遇到过这种情况？

1 个解决方案

Either Prometheus needs to scrape the metrics of every 5 seconds (chances are that metrics may not be accurate), or you need to use pushgateway, like mentioned in this blog( https://banzaicloud.com/blog/spark-monitoring/ ) to push the metrics to Prometheus Prometheus 需要每 5 秒抓取一次指标（指标可能不准确），或者您需要使用 pushgateway，如本博客中所述（ https://banzaicloud.com/blog/spark-monitoring/ ）将指标推送到 Prometheus

Pushing the metrics to Prometheus , is a best practice for batch jobs .将指标推送到Prometheus是批处理作业的最佳实践。 Pulling the metrics from Prometheus is a best approach for long running services (ex:REST Services)从Prometheus提取指标是长期运行服务的最佳方法（例如： REST服务）