如何将flink任务或Back Pressure相关指标导出到prometheus？

Question

I followed the instructions Reporter to export flink metrics to prometheus, however it seems by default it only export job-manager related metrics to prometheus, see below: 我按照Reporter的说明将flink指标导出到prometheus，但默认情况下它似乎只将作业管理器相关的指标导出到prometheus，见下文：

Open http://localhost:9249/ , I just get the following info, no task or task manager related metrics found. 打开http：// localhost：9249 / ，我只是获取以下信息，没有找到任务或任务管理器相关的指标。

# HELP flink_jobmanager_Status_JVM_Memory_Mapped_MemoryUsed MemoryUsed (scope: jobmanager_Status_JVM_Memory_Mapped)
# TYPE flink_jobmanager_Status_JVM_Memory_Mapped_MemoryUsed gauge
flink_jobmanager_Status_JVM_Memory_Mapped_MemoryUsed{host="localhost",} 0.0
# HELP flink_jobmanager_Status_JVM_ClassLoader_ClassesUnloaded ClassesUnloaded (scope: jobmanager_Status_JVM_ClassLoader)
# TYPE flink_jobmanager_Status_JVM_ClassLoader_ClassesUnloaded gauge
flink_jobmanager_Status_JVM_ClassLoader_ClassesUnloaded{host="localhost",} 0.0
# HELP flink_jobmanager_Status_JVM_GarbageCollector_PS_Scavenge_Time Time (scope: jobmanager_Status_JVM_GarbageCollector_PS_Scavenge)
# TYPE flink_jobmanager_Status_JVM_GarbageCollector_PS_Scavenge_Time gauge
flink_jobmanager_Status_JVM_GarbageCollector_PS_Scavenge_Time{host="localhost",} 273.0
# HELP flink_jobmanager_job_lastCheckpointRestoreTimestamp lastCheckpointRestoreTimestamp (scope: jobmanager_job)
# TYPE flink_jobmanager_job_lastCheckpointRestoreTimestamp gauge
flink_jobmanager_job_lastCheckpointRestoreTimestamp{job_id="dfac65e575f318970e0225eab9688a2e",host="localhost",job_name="Popular_Places_to_Elasticsearch",} -1.0
# HELP flink_jobmanager_job_lastCheckpointAlignmentBuffered lastCheckpointAlignmentBuffered (scope: jobmanager_job)
# TYPE flink_jobmanager_job_lastCheckpointAlignmentBuffered gauge
flink_jobmanager_job_lastCheckpointAlignmentBuffered{job_id="dfac65e575f318970e0225eab9688a2e",host="localhost",job_name="Popular_Places_to_Elasticsearch",} 0.0
# HELP flink_jobmanager_job_lastCheckpointExternalPath lastCheckpointExternalPath (scope: jobmanager_job)
# TYPE flink_jobmanager_job_lastCheckpointExternalPath gauge
flink_jobmanager_job_lastCheckpointExternalPath{job_id="dfac65e575f318970e0225eab9688a2e",host="localhost",job_name="Popular_Places_to_Elasticsearch",} 0.0
# HELP flink_jobmanager_Status_JVM_Memory_Direct_TotalCapacity TotalCapacity (scope: jobmanager_Status_JVM_Memory_Direct)
# TYPE flink_jobmanager_Status_JVM_Memory_Direct_TotalCapacity gauge
flink_jobmanager_Status_JVM_Memory_Direct_TotalCapacity{host="localhost",} 2374599.0
# HELP flink_jobmanager_Status_JVM_Threads_Count Count (scope: jobmanager_Status_JVM_Threads)
# TYPE flink_jobmanager_Status_JVM_Threads_Count gauge
flink_jobmanager_Status_JVM_Threads_Count{host="localhost",} 47.0
# HELP flink_jobmanager_Status_JVM_Memory_Heap_Committed Committed (scope: jobmanager_Status_JVM_Memory_Heap)
# TYPE flink_jobmanager_Status_JVM_Memory_Heap_Committed gauge
flink_jobmanager_Status_JVM_Memory_Heap_Committed{host="localhost",} 1.058013184E9
# HELP flink_jobmanager_Status_JVM_Memory_NonHeap_Used Used (scope: jobmanager_Status_JVM_Memory_NonHeap)
# TYPE flink_jobmanager_Status_JVM_Memory_NonHeap_Used gauge
flink_jobmanager_Status_JVM_Memory_NonHeap_Used{host="localhost",} 7.02964E7
# HELP flink_jobmanager_job_restartingTime restartingTime (scope: jobmanager_job)
# TYPE flink_jobmanager_job_restartingTime gauge
flink_jobmanager_job_restartingTime{job_id="dfac65e575f318970e0225eab9688a2e",host="localhost",job_name="Popular_Places_to_Elasticsearch",} 0.0
# HELP flink_jobmanager_Status_JVM_GarbageCollector_PS_Scavenge_Count Count (scope: jobmanager_Status_JVM_GarbageCollector_PS_Scavenge)
# TYPE flink_jobmanager_Status_JVM_GarbageCollector_PS_Scavenge_Count gauge
flink_jobmanager_Status_JVM_GarbageCollector_PS_Scavenge_Count{host="localhost",} 24.0
# HELP flink_jobmanager_Status_JVM_Memory_NonHeap_Committed Committed (scope: jobmanager_Status_JVM_Memory_NonHeap)
# TYPE flink_jobmanager_Status_JVM_Memory_NonHeap_Committed gauge
flink_jobmanager_Status_JVM_Memory_NonHeap_Committed{host="localhost",} 7.2876032E7
# HELP flink_jobmanager_Status_JVM_GarbageCollector_PS_MarkSweep_Count Count (scope: jobmanager_Status_JVM_GarbageCollector_PS_MarkSweep)
# TYPE flink_jobmanager_Status_JVM_GarbageCollector_PS_MarkSweep_Count gauge
flink_jobmanager_Status_JVM_GarbageCollector_PS_MarkSweep_Count{host="localhost",} 2.0
# HELP flink_jobmanager_job_downtime downtime (scope: jobmanager_job)
# TYPE flink_jobmanager_job_downtime gauge
flink_jobmanager_job_downtime{job_id="dfac65e575f318970e0225eab9688a2e",host="localhost",job_name="Popular_Places_to_Elasticsearch",} 0.0
# HELP flink_jobmanager_job_numberOfFailedCheckpoints numberOfFailedCheckpoints (scope: jobmanager_job)
# TYPE flink_jobmanager_job_numberOfFailedCheckpoints gauge
flink_jobmanager_job_numberOfFailedCheckpoints{job_id="dfac65e575f318970e0225eab9688a2e",host="localhost",job_name="Popular_Places_to_Elasticsearch",} 0.0
# HELP flink_jobmanager_job_numberOfInProgressCheckpoints numberOfInProgressCheckpoints (scope: jobmanager_job)
# TYPE flink_jobmanager_job_numberOfInProgressCheckpoints gauge
flink_jobmanager_job_numberOfInProgressCheckpoints{job_id="dfac65e575f318970e0225eab9688a2e",host="localhost",job_name="Popular_Places_to_Elasticsearch",} 0.0
# HELP flink_jobmanager_job_numberOfCompletedCheckpoints numberOfCompletedCheckpoints (scope: jobmanager_job)
# TYPE flink_jobmanager_job_numberOfCompletedCheckpoints gauge
flink_jobmanager_job_numberOfCompletedCheckpoints{job_id="dfac65e575f318970e0225eab9688a2e",host="localhost",job_name="Popular_Places_to_Elasticsearch",} 961.0
# HELP flink_jobmanager_Status_JVM_GarbageCollector_PS_MarkSweep_Time Time (scope: jobmanager_Status_JVM_GarbageCollector_PS_MarkSweep)
# TYPE flink_jobmanager_Status_JVM_GarbageCollector_PS_MarkSweep_Time gauge
flink_jobmanager_Status_JVM_GarbageCollector_PS_MarkSweep_Time{host="localhost",} 110.0
# HELP flink_jobmanager_Status_JVM_Memory_Mapped_Count Count (scope: jobmanager_Status_JVM_Memory_Mapped)
# TYPE flink_jobmanager_Status_JVM_Memory_Mapped_Count gauge
flink_jobmanager_Status_JVM_Memory_Mapped_Count{host="localhost",} 0.0
# HELP flink_jobmanager_Status_JVM_CPU_Load Load (scope: jobmanager_Status_JVM_CPU)
# TYPE flink_jobmanager_Status_JVM_CPU_Load gauge
flink_jobmanager_Status_JVM_CPU_Load{host="localhost",} 0.0025814303680169446
# HELP flink_jobmanager_Status_JVM_ClassLoader_ClassesLoaded ClassesLoaded (scope: jobmanager_Status_JVM_ClassLoader)
# TYPE flink_jobmanager_Status_JVM_ClassLoader_ClassesLoaded gauge
flink_jobmanager_Status_JVM_ClassLoader_ClassesLoaded{host="localhost",} 7693.0
# HELP flink_jobmanager_Status_JVM_Memory_Heap_Max Max (scope: jobmanager_Status_JVM_Memory_Heap)
# TYPE flink_jobmanager_Status_JVM_Memory_Heap_Max gauge
flink_jobmanager_Status_JVM_Memory_Heap_Max{host="localhost",} 1.058013184E9
# HELP flink_jobmanager_job_uptime uptime (scope: jobmanager_job)
# TYPE flink_jobmanager_job_uptime gauge
flink_jobmanager_job_uptime{job_id="dfac65e575f318970e0225eab9688a2e",host="localhost",job_name="Popular_Places_to_Elasticsearch",} 4811388.0
# HELP flink_jobmanager_Status_JVM_CPU_Time Time (scope: jobmanager_Status_JVM_CPU)
# TYPE flink_jobmanager_Status_JVM_CPU_Time gauge
flink_jobmanager_Status_JVM_CPU_Time{host="localhost",} 1.044894698E11
# HELP flink_jobmanager_Status_JVM_Memory_Direct_Count Count (scope: jobmanager_Status_JVM_Memory_Direct)
# TYPE flink_jobmanager_Status_JVM_Memory_Direct_Count gauge
flink_jobmanager_Status_JVM_Memory_Direct_Count{host="localhost",} 60.0
# HELP flink_jobmanager_Status_JVM_Memory_Heap_Used Used (scope: jobmanager_Status_JVM_Memory_Heap)
# TYPE flink_jobmanager_Status_JVM_Memory_Heap_Used gauge
flink_jobmanager_Status_JVM_Memory_Heap_Used{host="localhost",} 2.15962464E8
# HELP flink_jobmanager_job_lastCheckpointDuration lastCheckpointDuration (scope: jobmanager_job)
# TYPE flink_jobmanager_job_lastCheckpointDuration gauge
flink_jobmanager_job_lastCheckpointDuration{job_id="dfac65e575f318970e0225eab9688a2e",host="localhost",job_name="Popular_Places_to_Elasticsearch",} 16.0
# HELP flink_jobmanager_Status_JVM_Memory_NonHeap_Max Max (scope: jobmanager_Status_JVM_Memory_NonHeap)
# TYPE flink_jobmanager_Status_JVM_Memory_NonHeap_Max gauge
flink_jobmanager_Status_JVM_Memory_NonHeap_Max{host="localhost",} -1.0
# HELP flink_jobmanager_job_lastCheckpointSize lastCheckpointSize (scope: jobmanager_job)
# TYPE flink_jobmanager_job_lastCheckpointSize gauge
flink_jobmanager_job_lastCheckpointSize{job_id="dfac65e575f318970e0225eab9688a2e",host="localhost",job_name="Popular_Places_to_Elasticsearch",} 271280.0
# HELP flink_jobmanager_job_fullRestarts fullRestarts (scope: jobmanager_job)
# TYPE flink_jobmanager_job_fullRestarts gauge
flink_jobmanager_job_fullRestarts{job_id="dfac65e575f318970e0225eab9688a2e",host="localhost",job_name="Popular_Places_to_Elasticsearch",} 0.0
# HELP flink_jobmanager_Status_JVM_Memory_Direct_MemoryUsed MemoryUsed (scope: jobmanager_Status_JVM_Memory_Direct)
# TYPE flink_jobmanager_Status_JVM_Memory_Direct_MemoryUsed gauge
flink_jobmanager_Status_JVM_Memory_Direct_MemoryUsed{host="localhost",} 2374600.0
# HELP flink_jobmanager_job_totalNumberOfCheckpoints totalNumberOfCheckpoints (scope: jobmanager_job)
# TYPE flink_jobmanager_job_totalNumberOfCheckpoints gauge
flink_jobmanager_job_totalNumberOfCheckpoints{job_id="dfac65e575f318970e0225eab9688a2e",host="localhost",job_name="Popular_Places_to_Elasticsearch",} 961.0
# HELP flink_jobmanager_Status_JVM_Memory_Mapped_TotalCapacity TotalCapacity (scope: jobmanager_Status_JVM_Memory_Mapped)
# TYPE flink_jobmanager_Status_JVM_Memory_Mapped_TotalCapacity gauge
flink_jobmanager_Status_JVM_Memory_Mapped_TotalCapacity{host="localhost",} 0.0

My question is how to export task or Back Pressure related metrics such as numRecordsIn,numRecordsInPerSecond,numRecordsOut to prometheus also? 我的问题是如何导出任务或背压相关的指标，如numRecordsIn，numRecordsInPerSecond，numRecordsOut也prometheus？ what else I need to configure? 我还需要配置什么？

BTW, My test environment is Flink 1.5.2 with job manager and task manager located in the same windows machine, also I found the Flink 1.6.0 has the same issue. 顺便说一句，我的测试环境是Flink 1.5.2，作业管理器和任务管理器位于同一台Windows机器上，我也发现Flink 1.6.0有同样的问题。

Answer 1

When you are running a jobmanager and a taskmanager on the same host, then each needs its own port. 当您在同一主机上运行jobmanager和taskmanager时，每个都需要自己的端口。 In flink-conf.yaml you can configure a range of ports like this, for example: 在flink-conf.yaml中，您可以配置一系列这样的端口，例如：

metrics.reporters: prom
metrics.reporter.prom.class: org.apache.flink.metrics.prometheus.PrometheusReporter
metrics.reporter.prom.port: 9250-9260

If you do this, then you'll find the jobmanager's metrics at http://localhost:9250 and the taskmanager's metrics at http://localhost:9251 . 如果这样做，那么你将在http：// localhost：9250找到jobmanager的指标，在http：// localhost：9251找到 taskmanager的指标。

You will also need to adjust your prometheus.yml to match: 您还需要调整prometheus.yml以匹配：

scrape_configs:
  - job_name: 'jobmanager'
    static_configs:
    - targets: ['localhost:9250']
  - job_name: 'taskmanager'
    static_configs:
    - targets: ['localhost:9251']

如何将flink任务或Back Pressure相关指标导出到prometheus？

问题描述

1 个解决方案

解决方案1
3 已采纳 2018-08-22 16:00:15

如何将flink任务或Back Pressure相关指标导出到prometheus？

问题描述

1 个解决方案

解决方案1 3 已采纳 2018-08-22 16:00:15

解决方案1
3 已采纳 2018-08-22 16:00:15