Bash/AWS CLI：試圖弄清楚如何在通過 AWS SSM 重啟 EC2 實例組時通過 2 次檢查來驗證一系列正常運行時間

Question

我一直在努力為 bash 腳本找出解決此問題的最佳方法。 我有一個命令可以在幾分鍾內檢查服務器組的正常運行時間。 我只想在所有服務器都啟動 5 分鍾后繼續進行下一組重新啟動，但還想驗證它們是否已經超過一個小時沒有啟動，以防重新啟動不需要。

我最初試圖設置一個 while 循環，該循環將繼續發出命令來檢查正常運行時間並將 output 發送到數組中。 我試圖弄清楚如何循環遍歷一個數組，直到該數組的所有元素都大於 5 並且小於。 我什至在大於 5 的第一次檢查中都沒有成功。甚至可以連續寫入數組並對數組中的每個值執行算術檢查，以便在 while 循環中所有值都必須大於 X？ 將其當前正常運行時間放入陣列的服務器數量因組而異，因此陣列中的值並不總是相同的數量。

數組甚至是正確的方法嗎？ 我會提供迄今為止我嘗試過的示例，但這是一個巨大的混亂，我認為從頭開始只是詢問輸入可能是最好的開始。

我正在運行的命令的 Output 類似於以下內容：

編輯

由於提供的幫助，我能夠為此獲得一個功能性的概念證明，我很興奮。 在這里，它可能會幫助任何人在未來嘗試做類似的事情。 手頭的問題是，我們使用 AWS SSM 進行所有 Windows 服務器修補，並且很多時候，當 SSM 告訴服務器在修補 SSM 代理后重新啟動時，需要很長時間才能簽入。這會減慢我們的整個過程，而現在整個過程是相當手動的幾十個補丁組。 很多時候，我們必須 go 並手動驗證服務器確實在我們告訴它從 SSM 重新啟動后確實重新啟動，以便我們知道我們可以為下一個補丁組啟動重新啟動。 有了這個，我們將能夠發出一個腳本，以正確的順序重新啟動我們的補丁組，並在繼續下一個組之前驗證服務器是否已正確重新啟動。

#!/bin/bash

### The purpose of this script is to automate the execution of commands required to reboot groups of AWS Windows servers utilizing SSM while also verifying their uptime and only continuing on to the next group once the previous has reached X # of minutes. This solves the problems of AWS SSM Agents not properly checking in with SSM post-reboot.

patchGroups=(01 02 03)                      # array containing the values of the RebootGroup tag


for group in "${patchGroups[@]}"
do
    printf "Rebooting Patch Group %q\n" "$group"
    aws ec2 reboot-instances --instance-ids `aws ec2 describe-instances --filters "Name=tag:RebootGroup,Values=$group" --query 'Reservations[].Instances[].InstanceId' --output text`

    sleep 2m

    unset      passed failed serverList                      # wipe arrays
    declare -A passed failed serverList                      # declare associative arrays

    serverList=$(aws ec2 describe-instances --filter "Name=tag:RebootGroup,Values=$group" --query 'Reservations[*].Instances[*].[InstanceId]' --output text)

    for server in ${serverList}                  # loop through list of servers
    do
        failed["${server}"]=0                     # add to the failed[] array
    done

    while [[ "${#failed[@]}" -gt 0 ]]             # loop while number of servers in the failed[] array is greater than 0
    do
        for server in "${!failed[@]}"             # loop through servers in the failed[] array
        do
            ssmID=$(aws ssm send-command --document-name "AWS-RunPowerShellScript" --document-version "1" --targets "[{\"Key\":\"InstanceIds\",\"Values\":[\"$server\"]}]" --parameters '{"commands":["$wmi = Get-WmiObject -Class Win32_OperatingSystem ","$uptimeMinutes =    ($wmi.ConvertToDateTime($wmi.LocalDateTime)-$wmi.ConvertToDateTime($wmi.LastBootUpTime) | select-object -expandproperty \"TotalMinutes\")","[int]$uptimeMinutes"],"workingDirectory":[""],"executionTimeout":["3600"]}' --timeout-seconds 600 --max-concurrency    "50" --max-errors "0" --region us-west-2 --output text --query "Command.CommandId")

            sleep 5

            uptime=$(aws ssm list-command-invocations --command-id "$ssmID" --details --query 'CommandInvocations[].CommandPlugins[].Output' --output text | sed 's/\r$//')

            printf "Checking instance ID %q\n" "$server"
            printf "Value of uptime is = %q\n" "$uptime"

            # if uptime is within our 'success' window then move server to passed[] array

            if [[ "${uptime}" -ge 3 && "${uptime}" -lt 60 ]] 
            then
                passed["${server}"]="${uptime}"   # add to passed[] array
                printf "Server with instance ID %q has successfully rebooted.\n" "$server"
                unset failed["${server}"]         # remove from failed[] array
            fi
        done

        # display current status (edit/remove as desired)

        printf "\n++++++++++++++ successful reboots\n"
        printf "%s\n" "${!passed[@]}" | sort -n

        printf "\n++++++++++++++ failed reboot\n"

        for server in ${!failed[@]}
        do
            printf "%s - %s (mins)\n" "${server}" "${failed[${server}]}"
        done | sort -n

        printf "\n"

        sleep 60                            # adjust as necessary
    done
done

Answer 1

聽起來您需要不斷重新評估 output 的正常運行時間才能獲得所需的數據，因此數組或其他變量可能會讓您陷入困境。 從功能上考慮這一點（如在functions中）。 您需要一個 function 來檢查正常運行時間是否在您想要的范圍內，只需一次。 然后，您需要定期運行 function。 如果成功，則觸發重新啟動。 如果它失敗了，你讓它稍后再試一次。

考慮這段代碼：

uptime_in_bounds() {
    local min="$1"
    local max="$2"
    local uptime_secs

    # The first value in /proc/uptime is the number of seconds the
    # system has been up. We have to truncate it to an integer…
    read -r uptime_float _ < /proc/uptime
    uptime_secs="${uptime_float%.*}"

    # A shell function reflects the exit status of its last command.
    # This function "succeeds" if the uptime_secs is between min and max.
    (( min < uptime_secs && max > uptime_secs ))
}
if uptime_in_bounds 300 3600; then
    sudo reboot  # or whatever
fi

Answer 2

總體思路...可能需要根據 OP 如何跟蹤服務器、獲取正常運行時間等進行一些調整...

# for a given set of servers, and assuming stored in variable ${server_list} ...

unset      passed failed                      # wipe arrays
declare -A passed failed                      # declare associative arrays

for server in ${server_list}                  # loop through list of servers
do
    failed["${server}"]=0                     # add to the failed[] array
done

while [[ "${#failed[@]}" -gt 0 ]]             # loop while number of servers in the failed[] array is greater than 0
do
    for server in "${!failed[@]}"             # loop through servers in the failed[] array
    do
        uptime=$( some_command_to_get_uptime_for_server "${server}" )

        # if uptime is within our 'success' window then move server to passed[] array

        if [[ "${uptime}" -gt 5 && "${uptime}" -lt 60 ]] 
        then
            passed["${server}"]="${uptime}"   # add to passed[] array
            unset failed["${server}"]         # remove from failed[] array
        else
            failed["${server}"]="${uptime}"
        fi
    done

    # display current status (edit/remove as desired)

    printf "\n++++++++++++++ successful reboots\n"
    printf "%s\n" "${!passed[@]}" | sort -n

    printf "\n++++++++++++++ failed reboot\n"

    for server in ${!failed[@]}
    do
        printf "%s - %s (mins)\n" "${server}" "${failed[${server}]}"
    done | sort -n

    printf "\n"

    sleep 30                            # adjust as necessary
done

注意事項：

此代碼可能是基於服務器集的更大循環結構的一部分（即，新的${server_list}
如果服務器列表是另一種格式（例如，文件、另一個數組等），則需要修改for循環以正確填充failed[]數組
OP 將需要編輯以添加代碼以查找給定${server}的正常運行時間
OP（顯然）可以根據需要自由重命名變量/數組
如果while循環繼續“太長”，OP可能需要決定做什么
如果新的${uptime}不在 5-60 分鍾范圍內，OP 可以添加一個else塊來為有問題的${server}執行一些其他操作

Bash/AWS CLI：試圖弄清楚如何在通過 AWS SSM 重啟 EC2 實例組時通過 2 次檢查來驗證一系列正常運行時間

問題描述

2 個解決方案

解決方案1
2 2020-12-18 19:51:13

解決方案2
1 已采納 2020-12-18 20:33:00

Bash/AWS CLI：試圖弄清楚如何在通過 AWS SSM 重啟 EC2 實例組時通過 2 次檢查來驗證一系列正常運行時間

問題描述

2 個解決方案

解決方案1 2 2020-12-18 19:51:13

解決方案2 1 已采納 2020-12-18 20:33:00

解決方案1
2 2020-12-18 19:51:13

解決方案2
1 已采納 2020-12-18 20:33:00