简体繁体 English

如何获得一堆短暂的Kubernetes职位的指标

[英]How to get metrics of bunches of short-lived Kubernetes jobs

原文 2019-07-19 09:43:27 7 1 kubernetes/ kubelet/ cadvisor

I have a case that short-lived(from seconds to 1-2 minutes) k8s jobs will be created on user request. 我有一个情况是，将根据用户请求创建短暂的（从几秒钟到1-2分钟）k8s作业。 I'm trying to retrieve job runtime metrics(like cpu and memory usage). 我正在尝试检索作业运行时指标（如cpu和内存使用情况）。

The methods I've thought of(and tried) includes: 我想到（并尝试过）的方法包括：

Prometheus query, like container_cpu_usage_seconds_total , but pull-based scape means that many short-lived jobs will not be included 普罗米修斯查询，例如container_cpu_usage_seconds_total ，但基于拉的scape表示将不包含许多短期作业
Pushgateway, but as prometheus suggests, ...valid use case for the Pushgateway is for capturing the outcome of a service-level batch job , so I doubt this is not the suitable case. Pushgateway，但是正如普罗米修斯所建议的， ... Pushgateway的有效用例是用于捕获服务级批处理作业的结果，因此我怀疑这不是合适的情况。
Metric-server, but metric-server only returns 404 on short-lived job pods, leading to worse results than Prometheus. 公制服务器，但公制服务器仅在短暂的作业容器上返回404，导致结果比普罗米修斯更糟。
Query /api/v1/nodes/{nodeName}/proxy/metrics/cadvisor directly. 直接查询/api/v1/nodes/{nodeName}/proxy/metrics/cadvisor 。 Though almost real-time, it returns all containers, so I have to manually parse the results and find what I need. 尽管几乎是实时的，但它会返回所有容器，因此我必须手动解析结果并找到所需的内容。

I'm thinking of using a lightweight monitor container beside the job worker container to retrieve the worker's metrics. 我正在考虑使用作业工人容器旁边的轻型监视器容器来检索工人的指标。 But I don't know whether this is a good idea, and even if so, how to retrieve the worker's metrics. 但是我不知道这是否是一个好主意，即使这样，也不知道如何检索工人的指标。

So my question is: 所以我的问题是：

What method do you recommend to retrieve a large number of short-lived jobs' cpu and memory usage? 您建议使用哪种方法来检索大量短期作业的cpu和内存使用情况？

1 个解决方案

As you wrote you used prometheus, pushgateways, metrics-server ns query /api/v1/nodes/{nodeName}/proxy/metrics/cadvisor if they don't satisfy you enough new approach which I recommend of montitoring and metrics saving of cluster performance is Litmus. 在编写时，您使用了普罗米修斯，pushgateway，metrics-server ns查询/ api / v1 / nodes / {nodeName} / proxy / metrics / cadvisor，如果它们不能满足您的要求，我建议您使用新的方法来监视和保存集群表现是石蕊。

Prometheus is most common and complex tool which may be used by most of engineers but Litmus is kind new tool which is focused on workload testing, metrics are saved and you can store them as long as you want. Prometheus是最常见，最复杂的工具，大多数工程师都可以使用，但是Litmus是一种新工具，专注于工作负载测试，可以保存指标并可以随时存储它们。

More information you can find here: litmus . 您可以在这里找到更多信息： litmus 。

Useful artice: litmus-openebs , this describe not to get metrics not only like memory usage. 有用的文章： litmus-openebs ，它描述的不仅是获取内存使用率，也不是获取指标。

Then you can generate charts in egg. 然后，您可以在鸡蛋中生成图表。 gnuplot. gnuplot。