简体   繁体   中英

Bash/AWS CLI: Trying to figure out how to validate an array of uptimes with 2 checks when rebooting groups of EC2 instances through AWS SSM

I have been struggling to figure out the best way to approach this problem for a bash script. I have a command that will check groups of servers for their uptime in minutes. I only want to continue on to the next group of reboots once all of the servers have been up for 5 minutes but also want to verify they haven't been up for over an hour in-case the reboot doesn't take.

I was originally trying to setup a while loop that would keep issuing the command to check uptimes and send the output into an array. I am trying to figure out how you can loop through an array until all elements of that array are greater than 5 and less than. I haven't even been successful in the first check of greater than 5. Is it even possible to continually write to an array and perform arithmetic checks against every value in the array so that all values must be greater than X in a while loop? The number of servers that will be putting their current uptime into the array is varied per group so it won't always be the same number of values in the array.

Is an array even the proper way to do this? I'd provide examples of what I have tried so far but it's a huge mess and I think starting from scratch just asking for input might be best to start with.

Output of the command I am running to pull uptimes looks similar to the following:

1
2
1
4
3
2

Edit

Due to the help provided I was able to get a functional proof of concept together for this and I'm stoked. Here it is in case it might help anyone trying to do something similar in the future. The problem at hand was that we utilize AWS SSM for all of our Windows server patching and many times when SSM tells servers to reboot after patching the SSM Agent takes ages to check in. This slows our entire process down which right now is fairly manual across dozens of patch groups. Many times we have to go and manually verify a server did indeed reboot after we told it to from SSM so that we know we can start the reboots for the next patch group. With this we will be able to issue a single script that issues reboots for our patch groups in the proper order and verifies that the servers have properly rebooted before continuing on to the next group.

#!/bin/bash

### The purpose of this script is to automate the execution of commands required to reboot groups of AWS Windows servers utilizing SSM while also verifying their uptime and only continuing on to the next group once the previous has reached X # of minutes. This solves the problems of AWS SSM Agents not properly checking in with SSM post-reboot.

patchGroups=(01 02 03)                      # array containing the values of the RebootGroup tag


for group in "${patchGroups[@]}"
do
    printf "Rebooting Patch Group %q\n" "$group"
    aws ec2 reboot-instances --instance-ids `aws ec2 describe-instances --filters "Name=tag:RebootGroup,Values=$group" --query 'Reservations[].Instances[].InstanceId' --output text`

    sleep 2m

    unset      passed failed serverList                      # wipe arrays
    declare -A passed failed serverList                      # declare associative arrays

    serverList=$(aws ec2 describe-instances --filter "Name=tag:RebootGroup,Values=$group" --query 'Reservations[*].Instances[*].[InstanceId]' --output text)

    for server in ${serverList}                  # loop through list of servers
    do
        failed["${server}"]=0                     # add to the failed[] array
    done

    while [[ "${#failed[@]}" -gt 0 ]]             # loop while number of servers in the failed[] array is greater than 0
    do
        for server in "${!failed[@]}"             # loop through servers in the failed[] array
        do
            ssmID=$(aws ssm send-command --document-name "AWS-RunPowerShellScript" --document-version "1" --targets "[{\"Key\":\"InstanceIds\",\"Values\":[\"$server\"]}]" --parameters '{"commands":["$wmi = Get-WmiObject -Class Win32_OperatingSystem ","$uptimeMinutes =    ($wmi.ConvertToDateTime($wmi.LocalDateTime)-$wmi.ConvertToDateTime($wmi.LastBootUpTime) | select-object -expandproperty \"TotalMinutes\")","[int]$uptimeMinutes"],"workingDirectory":[""],"executionTimeout":["3600"]}' --timeout-seconds 600 --max-concurrency    "50" --max-errors "0" --region us-west-2 --output text --query "Command.CommandId")

            sleep 5

            uptime=$(aws ssm list-command-invocations --command-id "$ssmID" --details --query 'CommandInvocations[].CommandPlugins[].Output' --output text | sed 's/\r$//')

            printf "Checking instance ID %q\n" "$server"
            printf "Value of uptime is = %q\n" "$uptime"

            # if uptime is within our 'success' window then move server to passed[] array

            if [[ "${uptime}" -ge 3 && "${uptime}" -lt 60 ]] 
            then
                passed["${server}"]="${uptime}"   # add to passed[] array
                printf "Server with instance ID %q has successfully rebooted.\n" "$server"
                unset failed["${server}"]         # remove from failed[] array
            fi
        done

        # display current status (edit/remove as desired)

        printf "\n++++++++++++++ successful reboots\n"
        printf "%s\n" "${!passed[@]}" | sort -n

        printf "\n++++++++++++++ failed reboot\n"

        for server in ${!failed[@]}
        do
            printf "%s - %s (mins)\n" "${server}" "${failed[${server}]}"
        done | sort -n

        printf "\n"

        sleep 60                            # adjust as necessary
    done
done

It sounds like you need to keep re-evaluating the output of uptime to get the data you need, so an array or other variable may just get you stuck. Think about this functionally (as in functions ). You need a function that checks if the uptime is within the bounds you want, just once . Then, you need to run that function periodically. If it is successful, you trigger the reboot. If it fails, you let it try again later.

Consider this code:

uptime_in_bounds() {
    local min="$1"
    local max="$2"
    local uptime_secs

    # The first value in /proc/uptime is the number of seconds the
    # system has been up. We have to truncate it to an integer…
    read -r uptime_float _ < /proc/uptime
    uptime_secs="${uptime_float%.*}"

    # A shell function reflects the exit status of its last command.
    # This function "succeeds" if the uptime_secs is between min and max.
    (( min < uptime_secs && max > uptime_secs ))
}
if uptime_in_bounds 300 3600; then
    sudo reboot  # or whatever
fi

General idea... will likely need some tweaking based on how OP is tracking servers, obtaining uptimes, etc...

# for a given set of servers, and assuming stored in variable ${server_list} ...

unset      passed failed                      # wipe arrays
declare -A passed failed                      # declare associative arrays

for server in ${server_list}                  # loop through list of servers
do
    failed["${server}"]=0                     # add to the failed[] array
done

while [[ "${#failed[@]}" -gt 0 ]]             # loop while number of servers in the failed[] array is greater than 0
do
    for server in "${!failed[@]}"             # loop through servers in the failed[] array
    do
        uptime=$( some_command_to_get_uptime_for_server "${server}" )

        # if uptime is within our 'success' window then move server to passed[] array

        if [[ "${uptime}" -gt 5 && "${uptime}" -lt 60 ]] 
        then
            passed["${server}"]="${uptime}"   # add to passed[] array
            unset failed["${server}"]         # remove from failed[] array
        else
            failed["${server}"]="${uptime}"
        fi
    done

    # display current status (edit/remove as desired)

    printf "\n++++++++++++++ successful reboots\n"
    printf "%s\n" "${!passed[@]}" | sort -n

    printf "\n++++++++++++++ failed reboot\n"

    for server in ${!failed[@]}
    do
        printf "%s - %s (mins)\n" "${server}" "${failed[${server}]}"
    done | sort -n

    printf "\n"

    sleep 30                            # adjust as necessary
done

NOTES :

  • this code would likely be part of a larger looping construct based on sets of servers (ie, new ${server_list}
  • if list of servers is in another format (eg, file, another array, etc) will need to modify the for loop to properly populate the failed[] array
  • OP will need to edit to add code for finding uptime for a given ${server}
  • OP (obviously) free to rename variables/arrays as desired
  • OP will probably need to decide on what to do if the while loop continues 'too long'
  • if a new ${uptime} is not within the 5-60 min range, OP can add an else block to perform some other operation(s) for the problematic ${server}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM