简体   繁体   English

Prometheus 是否可以在大型批处理作业中捕获每个进程的指标?

[英]Is it possible for Prometheus to capture metrics of each process in a large batch job?

As per title, is it possible for Prometheus to capture metrics of each individual process in a large batch job?根据标题,Prometheus 是否可以在大批量作业中捕获每个单独进程的指标?

This job runs hourly and processes records at a rate of about 500-1000 records/second, sending gauge metrics (unique to each record) to statsd_exporter for monitoring our SLOs.此作业每小时运行一次,以大约 500-1000 条记录/秒的速度处理记录,将仪表指标(每条记录唯一)发送到 statsd_exporter 以监控我们的 SLO。

however, i realized Prometheus can only capture as much as the scrape_interval allows, meaning its definitely missing some values sent to statsd_exporter (eg possible missing some spikes in value).但是,我意识到 Prometheus 只能捕获 scrape_interval 允许的数量,这意味着它肯定会丢失一些发送到 statsd_exporter 的值(例如,可能会丢失一些值的峰值)。

Is there a way to overcome this?有没有办法克服这个问题? or perhaps i should be looking at some other tools instead.或者也许我应该看看其他一些工具。

updates: provided example of metrics sent.更新:提供了发送的指标示例。 job label is limited to 10 different labels ie 10 timeseries job标签仅限于 10 个不同的标签,即 10 个时间序列

# HELP time_taken_gauge Time taken for a particular job type to finish processing a record.
# TYPE time_taken_gauge gauge
time_taken_gauge{job="a"} 123
time_taken_gauge{job="b"} 1314
time_taken_gauge{job="c"} 5435
time_taken_gauge{job="d"} 212
time_taken_gauge{job="e"} 231
time_taken_gauge{job="f"} 324
time_taken_gauge{job="g"} 15
time_taken_gauge{job="h"} 1213
time_taken_gauge{job="i"} 123
time_taken_gauge{job="j"} 1235

Only challenge is these are sent in at a much higher rate than Prometheus' scrape interval (1s), hence missing some records唯一的挑战是这些以比普罗米修斯的刮擦间隔(1s)高得多的速率发送,因此丢失了一些记录

time_taken_gauge{job="a"} 123
time_taken_gauge{job="a"} 1232 <- scraped
time_taken_gauge{job="a"} 12412
time_taken_gauge{job="a"} 53453 <- high value metric missed but potentially problematic
time_taken_gauge{job="a"} 1564
time_taken_gauge{job="a"} 756
time_taken_gauge{job="a"} 34 <- scraped
time_taken_gauge{job="a"} 15433
.
.
.
time_taken_gauge{job="a"} 345 <- scraped

500-1000 records/second, sending gauge metrics (unique to each record) 500-1000 条记录/秒,发送仪表指标(每条记录唯一)

That's going to be high cardinality.这将是高基数。 If you want a unique metric per record then you need an event logging system like ELK, not metrics systems like Prometheus or whatever you have statsd feeding into.如果你想要每条记录的唯一指标,那么你需要一个像 ELK 这样的事件日志系统,而不是像 Prometheus 这样的指标系统或任何你有 statsd 输入的系统。

The way to overcome short-lived jobs, or slow jobs is to use a Pushgateway, which will store all metrics pushes to it and expose them to be scraped.克服短期工作或缓慢工作的方法是使用 Pushgateway,它将存储所有推送到它的指标并公开它们以供抓取。

This is the standard for handling such cases, I'm not familiar with alternatives..Be happy to.这是处理这种情况的标准,我不熟悉替代方案..很高兴。

The Prometheus Pushgateway exists to allow ephemeral and batch jobs to expose their metrics to Prometheus. Prometheus Pushgateway 的存在是为了允许临时和批处理作业将其指标公开给 Prometheus。 Since these kinds of jobs may not exist long enough to be scraped, they can instead push their metrics to a Pushgateway.由于这些类型的作业可能存在的时间不够长而无法抓取,因此他们可以将其指标推送到 Pushgateway。 The Pushgateway then exposes these metrics to Prometheus. Pushgateway 然后将这些指标公开给 Prometheus。

https://github.com/prometheus/pushgateway https://github.com/prometheus/pushgateway

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM