Bash/AWS CLI：试图弄清楚如何在通过 AWS SSM 重启 EC2 实例组时通过 2 次检查来验证一系列正常运行时间

Question

I have been struggling to figure out the best way to approach this problem for a bash script.我一直在努力为 bash 脚本找出解决此问题的最佳方法。 I have a command that will check groups of servers for their uptime in minutes.我有一个命令可以在几分钟内检查服务器组的正常运行时间。 I only want to continue on to the next group of reboots once all of the servers have been up for 5 minutes but also want to verify they haven't been up for over an hour in-case the reboot doesn't take.我只想在所有服务器都启动 5 分钟后继续进行下一组重新启动，但还想验证它们是否已经超过一个小时没有启动，以防重新启动不需要。

I was originally trying to setup a while loop that would keep issuing the command to check uptimes and send the output into an array.我最初试图设置一个 while 循环，该循环将继续发出命令来检查正常运行时间并将 output 发送到数组中。 I am trying to figure out how you can loop through an array until all elements of that array are greater than 5 and less than.我试图弄清楚如何循环遍历一个数组，直到该数组的所有元素都大于 5 并且小于。 I haven't even been successful in the first check of greater than 5. Is it even possible to continually write to an array and perform arithmetic checks against every value in the array so that all values must be greater than X in a while loop?我什至在大于 5 的第一次检查中都没有成功。甚至可以连续写入数组并对数组中的每个值执行算术检查，以便在 while 循环中所有值都必须大于 X？ The number of servers that will be putting their current uptime into the array is varied per group so it won't always be the same number of values in the array.将其当前正常运行时间放入阵列的服务器数量因组而异，因此阵列中的值并不总是相同的数量。

Is an array even the proper way to do this?数组甚至是正确的方法吗？ I'd provide examples of what I have tried so far but it's a huge mess and I think starting from scratch just asking for input might be best to start with.我会提供迄今为止我尝试过的示例，但这是一个巨大的混乱，我认为从头开始只是询问输入可能是最好的开始。

Output of the command I am running to pull uptimes looks similar to the following:我正在运行的命令的 Output 类似于以下内容：

Edit编辑

Due to the help provided I was able to get a functional proof of concept together for this and I'm stoked.由于提供的帮助，我能够为此获得一个功能性的概念证明，我很兴奋。 Here it is in case it might help anyone trying to do something similar in the future.在这里，它可能会帮助任何人在未来尝试做类似的事情。 The problem at hand was that we utilize AWS SSM for all of our Windows server patching and many times when SSM tells servers to reboot after patching the SSM Agent takes ages to check in. This slows our entire process down which right now is fairly manual across dozens of patch groups.手头的问题是，我们使用 AWS SSM 进行所有 Windows 服务器修补，并且很多时候，当 SSM 告诉服务器在修补 SSM 代理后重新启动时，需要很长时间才能签入。这会减慢我们的整个过程，而现在整个过程是相当手动的几十个补丁组。 Many times we have to go and manually verify a server did indeed reboot after we told it to from SSM so that we know we can start the reboots for the next patch group.很多时候，我们必须 go 并手动验证服务器确实在我们告诉它从 SSM 重新启动后确实重新启动，以便我们知道我们可以为下一个补丁组启动重新启动。 With this we will be able to issue a single script that issues reboots for our patch groups in the proper order and verifies that the servers have properly rebooted before continuing on to the next group.有了这个，我们将能够发出一个脚本，以正确的顺序重新启动我们的补丁组，并在继续下一个组之前验证服务器是否已正确重新启动。

#!/bin/bash

### The purpose of this script is to automate the execution of commands required to reboot groups of AWS Windows servers utilizing SSM while also verifying their uptime and only continuing on to the next group once the previous has reached X # of minutes. This solves the problems of AWS SSM Agents not properly checking in with SSM post-reboot.

patchGroups=(01 02 03)                      # array containing the values of the RebootGroup tag


for group in "${patchGroups[@]}"
do
    printf "Rebooting Patch Group %q\n" "$group"
    aws ec2 reboot-instances --instance-ids `aws ec2 describe-instances --filters "Name=tag:RebootGroup,Values=$group" --query 'Reservations[].Instances[].InstanceId' --output text`

    sleep 2m

    unset      passed failed serverList                      # wipe arrays
    declare -A passed failed serverList                      # declare associative arrays

    serverList=$(aws ec2 describe-instances --filter "Name=tag:RebootGroup,Values=$group" --query 'Reservations[*].Instances[*].[InstanceId]' --output text)

    for server in ${serverList}                  # loop through list of servers
    do
        failed["${server}"]=0                     # add to the failed[] array
    done

    while [[ "${#failed[@]}" -gt 0 ]]             # loop while number of servers in the failed[] array is greater than 0
    do
        for server in "${!failed[@]}"             # loop through servers in the failed[] array
        do
            ssmID=$(aws ssm send-command --document-name "AWS-RunPowerShellScript" --document-version "1" --targets "[{\"Key\":\"InstanceIds\",\"Values\":[\"$server\"]}]" --parameters '{"commands":["$wmi = Get-WmiObject -Class Win32_OperatingSystem ","$uptimeMinutes =    ($wmi.ConvertToDateTime($wmi.LocalDateTime)-$wmi.ConvertToDateTime($wmi.LastBootUpTime) | select-object -expandproperty \"TotalMinutes\")","[int]$uptimeMinutes"],"workingDirectory":[""],"executionTimeout":["3600"]}' --timeout-seconds 600 --max-concurrency    "50" --max-errors "0" --region us-west-2 --output text --query "Command.CommandId")

            sleep 5

            uptime=$(aws ssm list-command-invocations --command-id "$ssmID" --details --query 'CommandInvocations[].CommandPlugins[].Output' --output text | sed 's/\r$//')

            printf "Checking instance ID %q\n" "$server"
            printf "Value of uptime is = %q\n" "$uptime"

            # if uptime is within our 'success' window then move server to passed[] array

            if [[ "${uptime}" -ge 3 && "${uptime}" -lt 60 ]] 
            then
                passed["${server}"]="${uptime}"   # add to passed[] array
                printf "Server with instance ID %q has successfully rebooted.\n" "$server"
                unset failed["${server}"]         # remove from failed[] array
            fi
        done

        # display current status (edit/remove as desired)

        printf "\n++++++++++++++ successful reboots\n"
        printf "%s\n" "${!passed[@]}" | sort -n

        printf "\n++++++++++++++ failed reboot\n"

        for server in ${!failed[@]}
        do
            printf "%s - %s (mins)\n" "${server}" "${failed[${server}]}"
        done | sort -n

        printf "\n"

        sleep 60                            # adjust as necessary
    done
done

Answer 1

It sounds like you need to keep re-evaluating the output of uptime to get the data you need, so an array or other variable may just get you stuck.听起来您需要不断重新评估 output 的正常运行时间才能获得所需的数据，因此数组或其他变量可能会让您陷入困境。 Think about this functionally (as in functions ).从功能上考虑这一点（如在functions中）。 You need a function that checks if the uptime is within the bounds you want, just once .您需要一个 function 来检查正常运行时间是否在您想要的范围内，只需一次。 Then, you need to run that function periodically.然后，您需要定期运行 function。 If it is successful, you trigger the reboot.如果成功，则触发重新启动。 If it fails, you let it try again later.如果它失败了，你让它稍后再试一次。

Consider this code:考虑这段代码：

uptime_in_bounds() {
    local min="$1"
    local max="$2"
    local uptime_secs

    # The first value in /proc/uptime is the number of seconds the
    # system has been up. We have to truncate it to an integer…
    read -r uptime_float _ < /proc/uptime
    uptime_secs="${uptime_float%.*}"

    # A shell function reflects the exit status of its last command.
    # This function "succeeds" if the uptime_secs is between min and max.
    (( min < uptime_secs && max > uptime_secs ))
}
if uptime_in_bounds 300 3600; then
    sudo reboot  # or whatever
fi

Answer 2

General idea... will likely need some tweaking based on how OP is tracking servers, obtaining uptimes, etc...总体思路...可能需要根据 OP 如何跟踪服务器、获取正常运行时间等进行一些调整...

# for a given set of servers, and assuming stored in variable ${server_list} ...

unset      passed failed                      # wipe arrays
declare -A passed failed                      # declare associative arrays

for server in ${server_list}                  # loop through list of servers
do
    failed["${server}"]=0                     # add to the failed[] array
done

while [[ "${#failed[@]}" -gt 0 ]]             # loop while number of servers in the failed[] array is greater than 0
do
    for server in "${!failed[@]}"             # loop through servers in the failed[] array
    do
        uptime=$( some_command_to_get_uptime_for_server "${server}" )

        # if uptime is within our 'success' window then move server to passed[] array

        if [[ "${uptime}" -gt 5 && "${uptime}" -lt 60 ]] 
        then
            passed["${server}"]="${uptime}"   # add to passed[] array
            unset failed["${server}"]         # remove from failed[] array
        else
            failed["${server}"]="${uptime}"
        fi
    done

    # display current status (edit/remove as desired)

    printf "\n++++++++++++++ successful reboots\n"
    printf "%s\n" "${!passed[@]}" | sort -n

    printf "\n++++++++++++++ failed reboot\n"

    for server in ${!failed[@]}
    do
        printf "%s - %s (mins)\n" "${server}" "${failed[${server}]}"
    done | sort -n

    printf "\n"

    sleep 30                            # adjust as necessary
done

NOTES :注意事项：

this code would likely be part of a larger looping construct based on sets of servers (ie, new ${server_list}此代码可能是基于服务器集的更大循环结构的一部分（即，新的${server_list}
if list of servers is in another format (eg, file, another array, etc) will need to modify the for loop to properly populate the failed[] array如果服务器列表是另一种格式（例如，文件、另一个数组等），则需要修改for循环以正确填充failed[]数组
OP will need to edit to add code for finding uptime for a given ${server} OP 将需要编辑以添加代码以查找给定${server}的正常运行时间
OP (obviously) free to rename variables/arrays as desired OP（显然）可以根据需要自由重命名变量/数组
OP will probably need to decide on what to do if the while loop continues 'too long'如果while循环继续“太长”，OP可能需要决定做什么
if a new ${uptime} is not within the 5-60 min range, OP can add an else block to perform some other operation(s) for the problematic ${server}如果新的${uptime}不在 5-60 分钟范围内，OP 可以添加一个else块来为有问题的${server}执行一些其他操作

Bash/AWS CLI：试图弄清楚如何在通过 AWS SSM 重启 EC2 实例组时通过 2 次检查来验证一系列正常运行时间

问题描述

2 个解决方案

解决方案1
2 2020-12-18 19:51:13

解决方案2
1 已采纳 2020-12-18 20:33:00

Bash/AWS CLI：试图弄清楚如何在通过 AWS SSM 重启 EC2 实例组时通过 2 次检查来验证一系列正常运行时间

问题描述

2 个解决方案

解决方案1 2 2020-12-18 19:51:13

解决方案2 1 已采纳 2020-12-18 20:33:00

解决方案1
2 2020-12-18 19:51:13

解决方案2
1 已采纳 2020-12-18 20:33:00