简体   繁体   中英

How can I automatically kill idle GCE instances based on CPU usage?

I'm running some slightly unreliable software on some instances in an instance group. The software is installed and run by a startup script, and most of the time it works without issue, but about ~10% of the new instances run out of memory and crash due to some sort of memory leak in the software. I can't get this leak fixed myself, so in the meantime, I've been checking the instances every few hours and killing any that show an idle CPU (the software consumes all available CPU power normally).

However, I'm using preemptible instances, and they can be killed off and restarted at any time, leaving dead instances running whenever I'm not actively monitoring them. After a day of leaving things unattended, I usually see ~80-85% CPU usage in the dashboard, the rest of which is wasted.

Is there any automated way I can kill off these dead instances? Restarting them is already handled by the instance group.

The following worked for me. It's a bash script which uses the uptime UNIX command to check whether the 15-minute average load on the CPU is below a threshold, and automatically shuts down the system if this is true on ten consecutive checks. You need to run this within your VM instance.

Credit, and more detailed explanation: Rohit Rawat's blog .

#!/bin/bash
threshold=0.4

count=0
while true
do

  load=$(uptime | sed -e 's/.*load average: //g' | awk '{ print $3 }')
  res=$(echo $load'<'$threshold | bc -l)
  if (( $res ))
  then
    echo "Idling.."
    ((count+=1))
  fi
  echo "Idle minutes count = $count"

  if (( count>10 ))
  then
    echo Shutting down
    # wait a little bit more before actually pulling the plug
    sleep 300
    sudo poweroff
  fi

  sleep 60

done

It seems like there are two parts to this question:

  1. Identifying dead instances.
  2. Killing off those instances.

In terms of identifying dead instances, one way to do this would be to have a separate, management instance that does not run this software and that keeps tabs on the other instances. For example, it could do this by periodically sending a health request to the various instances and marking non-responsive instances or instances reporting an overly high CPU usage as unhealthy.

Once your management instance has identified the unhealthy instances that need to be reset, you should be able to reset those other instances using the API (I'm guessing the reset command) or by executing the same operation using the gcloud commandline tool.

This works without bc (not in GCP Container OS) using viswajithiii's answer and this post: How can I replace 'bc' tool in my bash script?

It also appends the history list to file before poweroff. I set my threshold very low, but the load is showing 0.00 even when I'm editing files via cli. Might work better if instance is under heavy load.

#!/bin/bash
threshold=10

count=0
while true
do

  load=$(uptime | sed -e 's/.*load average: //g' | awk '{ print $3 }')
  load2=$(awk -v a="$load" 'BEGIN {print a*100}')
  echo $load2
  if [ $load2 -lt $threshold ]
  then
    echo "Idling.."
    ((count+=1))
  fi
  echo "Idle minutes count = $count"

  if (( count>10 ))
  then
    echo Shutting down
    # wait a little bit more before actually pulling the plug
    sleep 300
    history -a
    sudo poweroff
  fi

  sleep 60

done

That's not working for my low cpu, but this seems too:

#!/bin/bash
threshold=1

count=0
while true
do

  load=$(awk '{u=$2+$4; t=$2+$4+$5; if (NR==1){u1=u; t1=t;} else print ($2+$4-u1) * 1000 / (t-t1); }' <(grep 'cpu ' /proc/stat) <(sleep 1;grep 'cpu ' /proc/stat))
  load2=$(printf "%.0f\n" $load)  
  echo $load
  echo $load2
  if [[ $load2 -lt $threshold ]]
  then
    echo "Idling.."
    ((count+=1))
  fi
  echo "Idle minutes count = $count"

  if (( count>10 ))
  then
    echo Shutting down
    # wait a little bit more before actually pulling the plug
    sleep 300
    history -a
    sudo poweroff
  fi

  sleep 60

done

It only works with both echo loads for some reason.

credits:

How to get overall CPU usage (eg 57%) on Linux https://unix.stackexchange.com/questions/89712/how-to-convert-floating-point-number-to-integer

FYI: according to here, GCP monitoring agent is not available for N type instances: Google Cloud Platform: how to monitor memory usage of VM instances

Put this in a startup script in /etc/my_init.d and make it executable:

sudo mkdir /etc/my_init.d
sudo mv autooff.sh /etc/my_init.d/autooff.sh
sudo chmod 755 /etc/my_init.d/autooff.sh

Actually, that's being deleted.Instead add to Custom Metadata in Edit for the instance: startup-script and #! /bin/bash \\n~./autooff.sh #! /bin/bash \\n~./autooff.sh

I wish I could add this as a comment to viswajithiii answer but I'm just shy of the reputations necessary to comment.

I found the static threshold variable to be inappropriate when I am using cloud VM's with variable numbers of cpu's as the output of uptime scales with the number of CPU's as discussed here .

My updated script adds two lines below the threshold assignment to scale the threshold by the number of cpu's. This allows me to set a percentage cpu utilization that will work across VM's with different numbers of cpu's.

Otherwise, the script is the same as viswajithiii's.

#!/bin/bash

threshold=0.4
n_cpu=$( grep 'model name' /proc/cpuinfo | wc -l )
threshold=$( echo $n_cpu*$threshold | bc )

count=0
while true
do

  load=$(uptime | sed -e 's/.*load average: //g' | awk '{ print $3 }')
  res=$(echo $load'<'$threshold | bc -l)
  if (( $res ))
  then
    echo "Idling.."
    ((count+=1))
  fi
  echo "Idle minutes count = $count"

  if (( count>10 ))
  then
    echo Shutting down
    # wait a little bit more before actually pulling the plug
    sleep 300
    sudo poweroff
  fi

  sleep 60

done

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM