[英]What is AWS Cloudwatch Agent disk_used_percent measuring? It does not match the usage I see with lsblk or df
I have a t4g.large EC2 instance, running Ubuntu 22.04, with a single 30GB storage volume.我有一个 t4g.large EC2 实例,运行 Ubuntu 22.04,具有单个 30GB 存储卷。 I have installed and configured the Cloudwatch Agent to monitor disk usage.我已经安装并配置了 Cloudwatch 代理来监控磁盘使用情况。
Right now, the metrics on Cloudwatch show that the disk is 56% full.现在,Cloudwatch 上的指标显示磁盘已满 56%。
If I run lsblk -f
, I see this (I deleted the uuid column for conciseness):如果我运行lsblk -f
,我会看到这个(为了简洁起见,我删除了 uuid 列):
NAME FSTYPE FSVER LABEL FSAVAIL FSUSE% MOUNTPOINTS
loop0 squashfs 4.0 0 100% /snap/core20/1699
loop1 squashfs 4.0 0 100% /snap/amazon-ssm-agent/5657
loop2 squashfs 4.0
loop3 squashfs 4.0 0 100% /snap/lxd/23545
loop4 squashfs 4.0 0 100% /snap/core18/2658
loop5 squashfs 4.0 0 100% /snap/core18/2636
loop6 squashfs 4.0 0 100% /snap/snapd/17885
loop7 squashfs 4.0 0 100% /snap/amazon-ssm-agent/6313
loop8 squashfs 4.0 0 100% /snap/core20/1740
nvme0n1
├─nvme0n1p1 ext4 1.0 cloudimg-rootfs 2.9G 90% /
└─nvme0n1p15 vfat FAT32 UEFI 92.4M 5% /boot/efiNAME
If I run df -h
, I see this:如果我运行df -h
,我会看到:
Filesystem Size Used Avail Use% Mounted on
/dev/root 29G 27G 2.9G 91% /
tmpfs 3.9G 0 3.9G 0% /dev/shm
tmpfs 1.6G 1.1M 1.6G 1% /run
tmpfs 5.0M 0 5.0M 0% /run/lock
/dev/nvme0n1p15 98M 5.1M 93M 6% /boot/efi
tmpfs 782M 8.0K 782M 1% /run/user/1000
I don't understand where 56% could be coming from.我不明白 56% 是从哪里来的。 Even if the Cloudwatch agent is doing a sum over all of the mount points, it would come out to ~75%, not 56%.即使 Cloudwatch 代理对所有挂载点进行求和,结果也会约为 75%,而不是 56%。
This is my config for the agent:这是我的代理配置:
{
"agent": {
"metrics_collection_interval": 60,
"run_as_user": "root"
},
"metrics": {
"aggregation_dimensions": [
[
"InstanceId"
]
],
"append_dimensions": {
"AutoScalingGroupName": "${aws:AutoScalingGroupName}",
"ImageId": "${aws:ImageId}",
"InstanceId": "${aws:InstanceId}",
"InstanceType": "${aws:InstanceType}"
},
"metrics_collected": {
"collectd": {
"metrics_aggregation_interval": 60
},
"disk": {
"measurement": [
"used_percent"
],
"metrics_collection_interval": 60,
"resources": [
"*"
]
},
"mem": {
"measurement": [
"mem_used_percent"
],
"metrics_collection_interval": 60
},
"statsd": {
"metrics_aggregation_interval": 60,
"metrics_collection_interval": 30,
"service_address": ":8125"
}
}
}
}
I tried changing "*" to "/" or "/dev/root" in the resources, and restarted the agent, but it has not made any difference in the reported value.我尝试将资源中的“*”更改为“/”或“/dev/root”,然后重新启动代理,但它对报告值没有任何影响。
Edit: I've now deleted a bunch of files and lsblk reports 33% disk usage at the "/" mount point, while cloudwatch says 52%.编辑:我现在删除了一堆文件,lsblk 报告“/”挂载点的磁盘使用率为 33%,而 cloudwatch 则为 52%。
I figured it out.我想到了。 The culprit is this part of the config:罪魁祸首是配置的这一部分:
"aggregation_dimensions": [
[
"InstanceId"
]
],
This means that the agent sends an "aggregate" value to cloudwatch, which is what I was using by accident.这意味着代理向 cloudwatch 发送了一个“聚合”值,这是我无意中使用的。 To get this aggregate, I navigated through the Metrics in the Cloudwatch GUI like "CWAgent" - "InstanceId" - "disk_used_percent".为了获得这个聚合,我浏览了 Cloudwatch GUI 中的指标,如“CWAgent”-“InstanceId”-“disk_used_percent”。 This reports a set of data points for each point in time - all the results for all the different paths that the agent is reporting on.这会报告每个时间点的一组数据点 - 代理报告的所有不同路径的所有结果。 From there you can select "average", "max", "min", etc. to use this data.从那里您可以选择“平均”、“最大”、“最小”等以使用此数据。 I had selected "average".我选择了“平均”。
What I should have done was navigate through "CWAgent" - "ImageId, InstanceId, InstanceType, device, fstype, path" - "disk_used_percent" for path /.我应该做的是通过“CWAgent”导航 - “ImageId,InstanceId,InstanceType,设备,fstype,路径” - 路径 / 的“disk_used_percent”。 Then I would be looking at only the value for that path, there would only be one sample per time step, and it would match what I see in the terminal.然后我将只查看该路径的值,每个时间步长只有一个样本,并且它会与我在终端中看到的相匹配。
Note: If you really want to dive deep, you can check out the collectd config at /etc/collectd/collectd.conf
, which has a config for "".注意:如果您真的想深入了解,可以在 /etc/collectd/collectd.conf 查看/etc/collectd/collectd.conf
配置,其中有一个配置为“”。 This should point you to the path where collectd is storing the df information that the cloudwatch agent is reading.这应该指向 collectd 存储 cloudwatch 代理正在读取的 df 信息的路径。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.