Very slow script

Question

I have a problem. I need to write a bash script that will find all files and directories in given path and will display some info about results. Allowed time: 30 seconds.

#!/bin/bash

DIRS=0
FILES=0
OLD_FILES=0
LARGE_FILES=0
TMP_FILES=0
EXE_FILES=0
IMG_FILES=0
SYM_LINKS=0
TOTAL_BYTES=0

#YEAR_AGO=$(date -d "now - 1 year" +%s)
#SECONDS_IN_YEAR=31536000

function check_dir {
    for entry in "$1"/*
    do
        if [ -d "$entry" ]; then
            ((DIRS+=1))
            check_dir "$entry"
        else if [ -f "$entry" ]; then
                ((FILES+=1))
                #SIZE=$(stat -c%s "$entry")
                #((TOTAL_BYTES+=SIZE))
                #CREATE_DATE=$(date -r "$entry" +%s)
                #CREATE_DATE=$(stat -c%W "$entry")
                #DIFF=$((CREATE_DATE-YEAR_AGO))
                #if [ $DIFF -ge $SECONDS_IN_YEAR ]; then
                #   ((OLD_FILES+=1))
                #fi
             fi

        fi
    done
}

if [ $# -ne 2 ]; then
    echo "Usage: ./srpt path emailaddress"
    exit 1
fi

if [ ! -d $1 ]; then
    echo "Provided path is invalid"
    exit 1
fi

check_dir $1

echo "Execution time $SECONDS"
echo "Dicrecoties $DIRS"
echo "Files $FILES"
echo "Sym links $SYM_LINKS"
echo "Old files $OLD_FILES"
echo "Large files $LARGE_FILES"
echo "Graphics files $IMG_FILES"
echo "Temporary files $TMP_FILES"
echo "Executable files $EXE_FILES"
echo "Total file size $TOTAL_BYTES"

Here are result of executing with commented lines above:

Execution time 1
Dicrecoties 931
Files 14515
Sym links 0
Old files 0
Large files 0
Graphics files 0
Temporary files 0
Executable files 0
Total file size 0

If I'll delete comment from

SIZE=$(stat -c%s "$entry")
((TOTAL_BYTES+=SIZE))

I got:

Execution time 31
Dicrecoties 931
Files 14515
Sym links 0
Old files 0
Large files 0
Graphics files 0
Temporary files 0
Executable files 0
Total file size 447297022

31 seconds. How can I speed up my script? Another +30 seconds gives finding of files with date creating more the one year

Answer 1

More often than not, using loops in shells is an indication that you're going for the wrong approach.

A shell is before all a tool to run other tools.

Though it can do counting, awk is a better tool to do it.

Though it can list and find files, find is better at it.

The best shell scripts are those that manage to have a few tools contribute to the task, not those that start millions of tools in sequence and where all the job is done by the shell.

Here, typically a better approach would be to have find find the files and gather all the data you need, and have awk munch it and return the statistics. Here using GNU find and GNU awk (for RS='\\0' ) and GNU date (for -d ):

find . -printf '%y.%s.%Ts%p\0' |
  awk -v RS='\0' -F'[.]' -v yearago="$(date -d '1 year ago' +%s)" '
    {
      type[$1]++; 
      if ($1 == "f") {
        total_size+=$2
        if ($3 < yearago) old++
        if (!index($NF, "/")) ext[tolower($NF)]++
      }
    }
    END {
      printf("%20s: %d\n", "Directories", type["d"])
      printf("%20s: %d\n", "Total size", total_size)
      printf("%20s: %d\n", "old", old)
      printf("%20s: %d\n", "jpeg", ext["jpg"]+ext["jpeg"])
      printf("%20s: %d\n", "and so on...", 0)
    }'

Answer 2

The key is to avoid firing up too many utilities. You seem to be invoking two or three per file, which will be quite slow.

Also, the comments show that handling filenames, in general, is complicated, particularly if the filenames might have spaces and/or newlines in them. But you don't actually need the filenames, if I understand your problem correctly, since you are only using them to collect information.

If you're using gnu find , you can extract the stat information directly from find , which will be quite a lot more efficient, since find needs to do a stat() anyway on every file. Here's an example, which pipes from find into awk for simplicity:

summary() {
  find "$@" '(' -type f -o -type d ')' -printf '%y %s %C@\n' |
  awk '$1=="d"{DIR+=1;next}
       $1!="f"{next}
       {REG+=1;SIZE+=$2}
       $3<'$(date +%s -d"last year")'{OLD+=1}
       END{printf "Directories: %d\nFiles: %d\nOld files: %d\nTotal Size: %d\n",
                  DIR, REG, OLD, SIZE}'
}

On my machine, that summarised 28718 files in 4817 directories in one-tenth of a second elapsed time. YMMV.

Answer 3

You surely want to avoid parsing the output of find as you did (see my comment): it'll break whenever you have spaces in filenames.

You surely want to avoid forking to external processes like your $(stat ...) or $(date ...) statements: each fork costs a lot!

It turns out that find on its own can do quite a lot. For example, if we want to count the numbers of files, dirs and links.

We all know the naive way in bash (pretty much what you've done):

#!/bin/bash

shopt -s globstar
shopt -s nullglob
shopt -s dotglob
nbfiles=0
nbdirs=0
for f in ./**; do
    [[ -f $f ]] && ((++nbfiles))
    [[ -d $f ]] && ((++nbdirs))
done
echo "There are $nbdirs directories and $nbfiles files, and we're very happy."

Caveat . This method counts links according to what they link to: a link to a file will be counted as a file.

How about the find way? Count number of files, directories and (symbolic) links:

#!/bin/bash

nbfiles=0
nbdirs=0
nblinks=0
while read t n; do
    case $t in
    dirs) ((nbdirs+=n+1)) ;;
    files) ((nbfiles+=n+1)) ;;
    links) ((nblinks+=n+1)) ;;
    esac
done < <(
    find . -type d -exec bash -c 'echo "dirs $#"' {} + \
         -or -type f -exec bash -c 'echo "files $#"' {} + \
         -or -type l -exec bash -c 'echo "links $#"' {} + 2> /dev/null
)
echo "There are $nbfiles files, $nbdirs dirs and $nblinks links. You're happy to know aren't you?"

Same principles, using associative arrays, more fields and more involved find logic:

#!/bin/bash

declare -A fields

while read f n; do
    ((fields[$f]+=n))
done < <(
    find . -type d -exec bash -c 'echo "dirs $(($#+1))"' {} + \
        -or -type f -exec bash -c 'echo "files $(($#+1))"' {} + -printf 'size %s\n' \
            \( \
                \( -iname '*.jpg' -printf 'jpg 1\n' -printf 'jpg_size %s\n' \) \
                -or -size +100M -printf 'large 1\n' \
            \) \
        -or -type l -exec bash -c 'echo "links $(($#+1))"' {} + 2> /dev/null
)

for f in "${!fields[@]}"; do
    printf "%s: %s\n" "$f" "${fields[$f]}"
done

I hope this will give you some ideas! Good luck!

Very slow script

Question

3 answers

solution1
5 ACCPTED 2013-06-23 21:41:20

solution2
4 2013-06-23 21:37:38

solution3
2 2013-06-23 21:28:58

Very slow script

Question

3 answers

solution1 5 ACCPTED 2013-06-23 21:41:20

solution2 4 2013-06-23 21:37:38

solution3 2 2013-06-23 21:28:58

solution1
5 ACCPTED 2013-06-23 21:41:20

solution2
4 2013-06-23 21:37:38

solution3
2 2013-06-23 21:28:58