什么是 AWS Cloudwatch Agent disk_used_percent 测量值？它与我在 lsblk 或 df 中看到的用法不符

Question

I have a t4g.large EC2 instance, running Ubuntu 22.04, with a single 30GB storage volume.我有一个 t4g.large EC2 实例，运行 Ubuntu 22.04，具有单个 30GB 存储卷。 I have installed and configured the Cloudwatch Agent to monitor disk usage.我已经安装并配置了 Cloudwatch 代理来监控磁盘使用情况。

Right now, the metrics on Cloudwatch show that the disk is 56% full.现在，Cloudwatch 上的指标显示磁盘已满 56%。

If I run lsblk -f , I see this (I deleted the uuid column for conciseness):如果我运行lsblk -f ，我会看到这个（为了简洁起见，我删除了 uuid 列）：

NAME         FSTYPE   FSVER LABEL           FSAVAIL FSUSE% MOUNTPOINTS  
loop0        squashfs 4.0                         0   100% /snap/core20/1699  
loop1        squashfs 4.0                         0   100% /snap/amazon-ssm-agent/5657  
loop2        squashfs 4.0                                   
loop3        squashfs 4.0                         0   100% /snap/lxd/23545  
loop4        squashfs 4.0                         0   100% /snap/core18/2658  
loop5        squashfs 4.0                         0   100% /snap/core18/2636  
loop6        squashfs 4.0                         0   100% /snap/snapd/17885  
loop7        squashfs 4.0                         0   100% /snap/amazon-ssm-agent/6313  
loop8        squashfs 4.0                         0   100% /snap/core20/1740  
nvme0n1                                                    
├─nvme0n1p1  ext4     1.0   cloudimg-rootfs    2.9G    90% / 
└─nvme0n1p15 vfat     FAT32 UEFI              92.4M     5% /boot/efiNAME

If I run df -h , I see this:如果我运行df -h ，我会看到：

Filesystem       Size  Used Avail Use% Mounted on
/dev/root         29G   27G  2.9G  91% /
tmpfs            3.9G     0  3.9G   0% /dev/shm
tmpfs            1.6G  1.1M  1.6G   1% /run
tmpfs            5.0M     0  5.0M   0% /run/lock
/dev/nvme0n1p15   98M  5.1M   93M   6% /boot/efi
tmpfs            782M  8.0K  782M   1% /run/user/1000

I don't understand where 56% could be coming from.我不明白 56% 是从哪里来的。 Even if the Cloudwatch agent is doing a sum over all of the mount points, it would come out to ~75%, not 56%.即使 Cloudwatch 代理对所有挂载点进行求和，结果也会约为 75%，而不是 56%。

This is my config for the agent:这是我的代理配置：

{
    "agent": {
        "metrics_collection_interval": 60,
        "run_as_user": "root"
    },
    "metrics": {
        "aggregation_dimensions": [
            [
                "InstanceId"
            ]
        ],
        "append_dimensions": {
            "AutoScalingGroupName": "${aws:AutoScalingGroupName}",
            "ImageId": "${aws:ImageId}",
            "InstanceId": "${aws:InstanceId}",
            "InstanceType": "${aws:InstanceType}"
        },
        "metrics_collected": {
            "collectd": {
                "metrics_aggregation_interval": 60
            },
            "disk": {
                "measurement": [
                    "used_percent"
                ],
                "metrics_collection_interval": 60,
                "resources": [
                    "*"
                ]
            },
            "mem": {
                "measurement": [
                    "mem_used_percent"
                ],
                "metrics_collection_interval": 60
            },
            "statsd": {
                "metrics_aggregation_interval": 60,
                "metrics_collection_interval": 30,
                "service_address": ":8125"
            }
        }
    }
}

I tried changing "*" to "/" or "/dev/root" in the resources, and restarted the agent, but it has not made any difference in the reported value.我尝试将资源中的“*”更改为“/”或“/dev/root”，然后重新启动代理，但它对报告值没有任何影响。

Edit: I've now deleted a bunch of files and lsblk reports 33% disk usage at the "/" mount point, while cloudwatch says 52%.编辑：我现在删除了一堆文件，lsblk 报告“/”挂载点的磁盘使用率为 33%，而 cloudwatch 则为 52%。

Answer 1

I figured it out.我想到了。 The culprit is this part of the config:罪魁祸首是配置的这一部分：

"aggregation_dimensions": [
            [
                "InstanceId"
            ]
        ],

This means that the agent sends an "aggregate" value to cloudwatch, which is what I was using by accident.这意味着代理向 cloudwatch 发送了一个“聚合”值，这是我无意中使用的。 To get this aggregate, I navigated through the Metrics in the Cloudwatch GUI like "CWAgent" - "InstanceId" - "disk_used_percent".为了获得这个聚合，我浏览了 Cloudwatch GUI 中的指标，如“CWAgent”-“InstanceId”-“disk_used_percent”。 This reports a set of data points for each point in time - all the results for all the different paths that the agent is reporting on.这会报告每个时间点的一组数据点 - 代理报告的所有不同路径的所有结果。 From there you can select "average", "max", "min", etc. to use this data.从那里您可以选择“平均”、“最大”、“最小”等以使用此数据。 I had selected "average".我选择了“平均”。

What I should have done was navigate through "CWAgent" - "ImageId, InstanceId, InstanceType, device, fstype, path" - "disk_used_percent" for path /.我应该做的是通过“CWAgent”导航 - “ImageId，InstanceId，InstanceType，设备，fstype，路径” - 路径 / 的“disk_used_percent”。 Then I would be looking at only the value for that path, there would only be one sample per time step, and it would match what I see in the terminal.然后我将只查看该路径的值，每个时间步长只有一个样本，并且它会与我在终端中看到的相匹配。

Note: If you really want to dive deep, you can check out the collectd config at /etc/collectd/collectd.conf , which has a config for "".注意：如果您真的想深入了解，可以在 /etc/collectd/collectd.conf 查看/etc/collectd/collectd.conf配置，其中有一个配置为“”。 This should point you to the path where collectd is storing the df information that the cloudwatch agent is reading.这应该指向 collectd 存储 cloudwatch 代理正在读取的 df 信息的路径。

什么是 AWS Cloudwatch Agent disk_used_percent 测量值？它与我在 lsblk 或 df 中看到的用法不符

问题描述

1 个解决方案

解决方案1
0 2022-12-19 20:07:25

什么是 AWS Cloudwatch Agent disk_used_percent 测量值？ 它与我在 lsblk 或 df 中看到的用法不符

问题描述

1 个解决方案

解决方案1 0 2022-12-19 20:07:25

什么是 AWS Cloudwatch Agent disk_used_percent 测量值？它与我在 lsblk 或 df 中看到的用法不符

解决方案1
0 2022-12-19 20:07:25