简体   繁体   中英

Is there a good way to detect a stale NFS mount

I have a procedure I want to initiate only if several tests complete successfully.

One test I need is that all of my NFS mounts are alive and well.

Can I do better than the brute force approach:


mount | sed -n "s/^.* on \(.*\) type nfs .*$/\1/p" | 
while read mount_point ; do 
  timeout 10 ls $mount_point >& /dev/null || echo "stale $mount_point" ; 
done

Here timeout is a utility that will run the command in the background, and will kill it after a given time, if no SIGCHLD was caught prior to the time limit, returning success/fail in the obvious way.


In English: Parse the output of mount , check (bounded by a timeout) every NFS mount point. Optionally (not in the code above) breaking on the first stale mount.

A colleague of mine ran into your script. This doesn't avoid a "brute force" approach, but if I may in Bash:

while read _ _ mount _; do 
  read -t1 < <(stat -t "$mount") || echo "$mount timeout"; 
done < <(mount -t nfs)

mount can list NFS mounts directly. read -t (a shell builtin) can time out a command. stat -t (terse output) still hangs like an ls *. ls yields unnecessary output, risks false positives on huge/slow directory listings, and requires permissions to access - which would also trigger a false positive if it doesn't have them.

while read _ _ mount _; do 
  read -t1 < <(stat -t "$mount") || lsof -b 2>/dev/null|grep "$mount"; 
done < <(mount -t nfs)

We're using it with lsof -b (non-blocking, so it won't hang too) in order to determine the source of the hangs.

Thanks for the pointer!

  • test -d (a shell builtin) would work instead of stat (a standard external) as well, but read -t returns success only if it doesn't time out and reads a line of input. Since test -d doesn't use stdout, a (( $? > 128 )) errorlevel check on it would be necessary - not worth the legibility hit, IMO.

Took me some time, but here is what I found which works in Python:

import signal, os, subprocess
class Alarm(Exception):
    pass

def alarm_handler(signum, frame):
    raise Alarm

pathToNFSMount = '/mnt/server1/' # or you can implement some function 
                                 # to find all the mounts...

signal.signal(signal.SIGALRM, alarm_handler)
signal.alarm(3)  # 3 seconds
try:
    proc = subprocess.call('stat '+pathToNFSMount, shell=True, stderr=subprocess.PIPE, stdout=subprocess.PIPE) 
    stdoutdata, stderrdata = proc.communicate()
    signal.alarm(0)  # reset the alarm
except Alarm:
    print "Oops, taking too long!"

Remarks:

  1. credit to the answer here .
  2. You could also use alternative scheme:

    os.fork() and os.stat()

check if the fork finished, if it has timed out you can kill it. You will need to work with time.time() and so on.

In addition to previous answers, which hangs under some circumstances, this snippet checks all suitable mounts, kills with signal KILL, and is tested with CIFS too:

grep -v tracefs /proc/mounts | cut -d' ' -f2 | \
  while read m; do \
    timeout --signal=KILL 1 ls -d $m > /dev/null || echo "$m"; \
  done

You could write a C program and check for ESTALE .

#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#include <iso646.h>
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>

int main(){
    struct stat st;
    int ret;
    ret = stat("/mnt/some_stale", &st);
    if(ret == -1 and errno == ESTALE){
        printf("/mnt/some_stale is stale\n");
        return EXIT_SUCCESS;
    } else {
        return EXIT_FAILURE;
    }
}

Writing a C program that checks for ESTALE is a good option if you don't mind waiting for the command to finish because of the stale file system. If you want to implement a "timeout" option the best way I've found to implement it (in a C program) is to fork a child process that tries to open the file. You then check if the child process has finished reading a file successfully in the filesystem within an allocated amount of time.

Here is a small proof of concept C program to do this:

#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include <unistd.h>
#include <errno.h>
#include <fcntl.h>
#include <sys/wait.h>


void readFile();
void waitForChild(int pid);


int main(int argc, char *argv[])
{
  int pid;

  pid = fork();

  if(pid == 0) {
    // Child process.
    readFile();
  }
  else if(pid > 0) {
    // Parent process.
    waitForChild(pid);
  }
  else {
    // Error
    perror("Fork");
    exit(1);
  }

  return 0;
}

void waitForChild(int child_pid)
{
  int timeout = 2; // 2 seconds timeout.
  int status;
  int pid;

  while(timeout != 0) {
    pid = waitpid(child_pid, &status, WNOHANG);
    if(pid == 0) {
      // Still waiting for a child.
      sleep(1);
      timeout--;
    }
    else if(pid == -1) {
      // Error
      perror("waitpid()");
      exit(1);
    }
    else {
      // The child exited.
      if(WIFEXITED(status)) {
        // Child was able to call exit().
        if(WEXITSTATUS(status) == 0) {
          printf("File read successfully!\n");
          return;
        }
      }
      printf("File NOT read successfully.\n");
      return;
    }
  }

  // The child did not finish and the timeout was hit.
  kill(child_pid, 9);
  printf("Timeout reading the file!\n");
}

void readFile()
{
  int fd;

  fd = open("/path/to/a/file", O_RDWR);
  if(fd == -1) {
    // Error
    perror("open()");
    exit(1);
  }
  else {
    close(fd);
    exit(0);
  }
}

I wrote https://github.com/acdha/mountstatus which uses an approach similar to what UndeadKernel mentioned, which I've found to be the most robust approach: it's a daemon which periodically scans all mounted filesystems by forking a child process which attempts to list the top-level directory and SIGKILL it if it fails to respond in a certain timeout, with both successes and failures recorded to syslog. That avoids issues with certain client implementations (eg older Linux) which never trigger timeouts for certain classes of error, NFS servers which are partially responsive but eg won't respond to actual calls like listdir , etc.

I don't publish them but the included Makefile uses fpm to build rpm and deb packages with an Upstart script.

Another way, using shell script. Works good for me:

#!/bin/bash
# Purpose:
# Detect Stale File handle and remove it
# Script created: July 29, 2015 by Birgit Ducarroz
# Last modification: --
#

# Detect Stale file handle and write output into a variable and then into a file
mounts=`df 2>&1 | grep 'Stale file handle' |awk '{print ""$2"" }' > NFS_stales.txt`
# Remove : ‘ and ’ characters from the output
sed -r -i 's/://' NFS_stales.txt && sed -r -i 's/‘//' NFS_stales.txt && sed -r -i 's/’//' NFS_stales.txt

# Not used: replace space by a new line
# stales=`cat NFS_stales.txt && sed -r -i ':a;N;$!ba;s/ /\n /g' NFS_stales.txt`

# read NFS_stales.txt output file line by line then unmount stale by stale.
#    IFS='' (or IFS=) prevents leading/trailing whitespace from being trimmed.
#    -r prevents backslash escapes from being interpreted.
#    || [[ -n $line ]] prevents the last line from being ignored if it doesn't end with a \n (since read returns a non-zero exit code when it encounters EOF).

while IFS='' read -r line || [[ -n "$line" ]]; do
    echo "Unmounting due to NFS Stale file handle: $line"
    umount -fl $line
done < "NFS_stales.txt"
#EOF

I'll just paste a snippet from our Icinga2 NFS stale mount monitoring Bash script here:

MOUNTS="$(mount -t nfs;mount -t nfs3;mount -t nfs4)"
MOUNT_POINTS=$(echo -e "$MOUNTS \n"|grep -v ^$|awk '{print $3}')

if [ -z "$MOUNT_POINTS" ]; then
        OUTPUT="[OK] No nfs mounts"
        set_result 0
else
        for i in $MOUNT_POINTS;do
                timeout 1 stat -t "$i" > /dev/null
                TMP_RESULT=$?
                set_result $TMP_RESULT
                set_output $TMP_RESULT "$i"
        done
fi

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM