Using GNU parallel command with gfind to gain in runtime for gupdatedb tool

Question

I make follow to the previous post combine parallel and gfind

I would like to build the gupdatedb database, containing all from main root / excepted the PRUNEPATHS listed more below. I am working on MacOS 10.15 Catalina.

So, I tried to modify the gupdatedb script on MacOS 10.15 to benefit from parallel command like this (notice the #: A2 part):

# : A2
cat | parallel -j32 $find {} $SEARCHPATHS $FINDOPTIONS \
    \( $prunefs_exp -type d -regex "$PRUNEREGEX" \) \
    -prune -o $print_option * :::

If I don't use cat | , I have the following warning message:

parallel: Warning: Input is read from the terminal. You are either an expert
parallel: Warning: (in which case: YOU ARE AWESOME!) or maybe you forgot
parallel: Warning: ::: or :::: or -a or to pipe data into parallel. If so
parallel: Warning: consider going through the tutorial: man parallel_tutorial
parallel: Warning: Press CTRL-D to exit.

and the process seems to hang.

Unfortunately, multiple threads of $find = gfind don't seem to run in the same time:

I have launched the script like this: sudo time gupdatedb

and below the result of: ps aux | grep find ps aux | grep find :

root             84865   0.0  0.0  4459044  15828 s002  S+    1:43PM   0:00.10 perl /usr/local/bin/parallel -j32 /usr/local/Cellar/findutils/4.7.0/bin/gfind {} / ( -fstype 9P -o -fstype NFS -o -fstype afs -o -fstype autofs -o -fstype cifs -o -fstype coda -o -fstype devfs -o -fstype devpts -o -fstype ftpfs -o -fstype iso9660 -o -fstype mfs -o -fstype ncpfs -o -fstype nfs -o -fstype nfs4 -o -fstype proc -o -fstype shfs -o -fstype smbfs -o -fstype sysfs -o -type d -regex \(^/afs$\)\|\(^/amd$\)\|\(^/proc$\)\|\(^/sfs$\)\|\(^/tmp$\)\|\(^/usr/tmp$\)\|\(^/var/tmp$\)\|\(^/Volumes$\) ) -prune -o -print0 Applications Library System Users Volumes bin cores dev etc home opt private sbin tmp usr var :::
root             84863   0.0  0.0  4268280    796 s002  S+    1:43PM   0:00.00 /usr/local/Cellar/findutils/4.7.0/libexec/gfrcode -0
root             84861   0.0  0.0  4282172    708 s002  S+    1:43PM   0:00.00 /bin/sh /usr/local/Cellar/findutils/4.7.0/libexec/bin/gupdatedb
root             84853   0.0  0.0  4273980   1164 s002  S+    1:43PM   0:00.01 /bin/sh /usr/local/Cellar/findutils/4.7.0/libexec/bin/gupdatedb
root             84850   0.0  0.0  5396228  10288 s008  S+    1:43PM   0:00.27 vim /usr/local/Cellar/findutils/4.7.0/libexec/bin/gupdatedb
root             84849   0.0  0.0  4788896   6740 s008  S+    1:43PM   0:00.03 sudo vim /usr/local/Cellar/findutils/4.7.0/libexec/bin/gupdatedb

Finally, the database may not be built, I am checking the size of: /usr/local/var/locate/locatedb.n and /usr/local/var/locate/locatedb but nothing is changing.

What's wrong in the syntax I used with parallel? (especially, I don't know how to handle the ... ::: options part of command)

PS: I have set in gupdatedb :

# Directories to not put in the database, which would otherwise be.
: ${PRUNEPATHS="
/afs
/amd
/proc
/sfs
/tmp
/usr/tmp
/var/tmp
/Volumes
"}

and

# You can set these in the environment, or use command-line options,
# to override their defaults:

# Any global options for find?
: ${FINDOPTIONS=}

# What shell shoud we use?  We should use a POSIX-ish sh.
: ${SHELL="/bin/sh"}

# Non-network directories to put in the database.
: ${SEARCHPATHS="/"}

Update 1

To be more accurate, here a post where I ask for a potential optimization (parallelization) with the couple parallel/find :

example of a potential parallelization with coupled parallel/find

I would like to do the same optimization but for script gupdatedb .

Update 2

I followed the advice of:

the defaut command into gupdatedb concerning my issue is:

$find $SEARCHPATHS $FINDOPTIONS \
 \( $prunefs_exp \
 -type d -regex "$PRUNEREGEX" \) -prune -o $print_option

So, I have just modified like this:

parallel -j32 $find {} $SEARCHPATHS $FINDOPTIONS \
    \( $prunefs_exp \
    -type d -regex "$PRUNEREGEX" \) -prune -o $print_option ::: /

and I get the following error:

/bin/sh: -c: line 0: syntax error near unexpected token `('
/bin/sh: -c: line 0: `/usr/local/Cellar/findutils/4.7.0/bin/gfind / / ( -fstype 9P -o -fstype NFS -o -fstype afs -o -fstype autofs -o -fstype cifs -o -fstype coda -o -fstype devfs -o -fstype devpts -o -fstype ftpfs -o -fstype iso9660 -o -fstype mfs -o -fstype ncpfs -o -fstype nfs -o -fstype nfs4 -o -fstype proc -o -fstype shfs -o -fstype smbfs -o -fstype sysfs -o -type d -regex \(^/private/tmp$\)\|\(^/private/var/folders$\)\|\(^/private/var/tmp$\)\|\(^*/Backups.backupdb$\)\|\(^/System$\)\|\(^/Volumes$\) ) -prune -o -print0'

What might be wrong here?

Update 3

here the script gupdatedb where you can see from line 300 my different tries:

#! /bin/sh
# updatedb -- build a locate pathname database
# Copyright (C) 1994-2019 Free Software Foundation, Inc.
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program.  If not, see <https://www.gnu.org/licenses/>.

# csh original by James Woods; sh conversion by David MacKenzie.

#exec 2> /tmp/updatedb-trace.txt
#set -x

version='
updatedb (GNU findutils) 4.7.0
Copyright (C) 1994-2019 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Eric B. Decker, James Youngman, and Kevin Dalley.
'

# File path names are not actually text, anyway (since there is no
# mechanism to enforce any constraint that the basename of a
# subdirectory has the same character encoding as the basename of its
# parent).  The practical effect is that, depending on the way a
# particular system is configured and the content of its filesystem,
# passing all the file names in the system through "sort" may generate
# character encoding errors in text-based tools like "sort".  To avoid
# this, we set LC_ALL=C.  This will, presumably, not work perfectly on
# systems where LC_ALL is not the way to do locale configuration or
# some other seting can override this.
LC_ALL=C
export LC_ALL

# We can't use substitution on PACKAGE_URL below because it
# (correctly) points to https://www.gnu.org/software/findutils/ instead
# of the bug reporting page.
usage="\
Usage: $0 [--findoptions='-option1 -option2...']
       [--localpaths='dir1 dir2...'] [--netpaths='dir1 dir2...']
       [--prunepaths='dir1 dir2...'] [--prunefs='fs1 fs2...']
       [--output=dbfile] [--netuser=user] [--localuser=user]
       [--dbformat] [--version] [--help]

Please see also the documentation at http://www.gnu.org/software/findutils/.
Report (and track progress on fixing) bugs in the updatedb
program via the GNU findutils bug-reporting page at
https://savannah.gnu.org/bugs/?group=findutils or, if
you have no web access, by sending email to <bug-findutils@gnu.org>.
"
changeto=/

for arg
do
  # If we are unable to fork, the back-tick operator will
  # fail (and the shell will emit an error message).  When
  # this happens, we exit with error value 71 (EX_OSERR).
  # Alternative candidate - 75, EX_TEMPFAIL.
  opt=`echo $arg|sed 's/^\([^=]*\).*/\1/'`  || exit 71
  val=`echo $arg|sed 's/^[^=]*=\(.*\)/\1/'` || exit 71
  case "$opt" in
    --findoptions) FINDOPTIONS="$val" ;;
    --localpaths) SEARCHPATHS="$val" ;;
    --netpaths) NETPATHS="$val" ;;
    --prunepaths) PRUNEPATHS="$val" ;;
    --prunefs) PRUNEFS="$val" ;;
    --output) LOCATE_DB="$val" ;;
    --netuser) NETUSER="$val" ;;
    --localuser) LOCALUSER="$val" ;;
    --changecwd)  changeto="$val" ;;
    --dbformat)   dbformat="$val" ;;
    --version) fail=0; echo "$version" || fail=1; exit $fail ;;
    --help)    fail=0; echo "$usage"   || fail=1; exit $fail ;;
    *) echo "updatedb: invalid option $opt
Try '$0 --help' for more information." >&2
       exit 1 ;;
  esac
done

frcode_options=""
case "$dbformat" in
    "")
        # Default, use LOCATE02
        ;;
    LOCATE02)
        ;;
    slocate)
        frcode_options="$frcode_options -S 1"
        ;;
    *)
        # The "old" database format is no longer supported.
        echo "Unsupported locate database format ${dbformat}: Supported formats are:" >&2
        echo "LOCATE02, slocate" >&2
        exit 1
esac


if true
then
    sort="/usr/bin/sort -z"
    print_option="-print0"
    frcode_options="$frcode_options -0"
else
    sort="/usr/bin/sort"
    print_option="-print"
fi

getuid() {
    # format of "id" output is ...
    # uid=1(daemon) gid=1(other)
    # for `id's that don't understand -u
    id | cut -d'(' -f 1 | cut -d'=' -f2
}

# figure out if su supports the -s option
select_shell() {
    if su "$1" -s $SHELL -c false < /dev/null  ; then
    # No.
    echo ""
    else
    if su "$1" -s $SHELL -c true < /dev/null  ; then
        # Yes.
        echo "-s $SHELL"
        else
        # su is unconditionally failing.  We won't be able to
        # figure out what is wrong, so be conservative.
        echo ""
    fi
    fi
}


# You can set these in the environment, or use command-line options,
# to override their defaults:

# Any global options for find?
: ${FINDOPTIONS="-mindepth 1 -maxdepth 1"}
#: ${FINDOPTIONS=""}

# What shell shoud we use?  We should use a POSIX-ish sh.
: ${SHELL="/bin/sh"}

# Non-network directories to put in the database.
: ${SEARCHPATHS="/"}

# Network (NFS, AFS, RFS, etc.) directories to put in the database.
: ${NETPATHS=}

# Directories to not put in the database, which would otherwise be.
: ${PRUNEPATHS="
/afs
/amd
/proc
/sfs
/tmp
/usr/tmp
/var/tmp
"}

# Trailing slashes result in regex items that are never matched, which
# is not what the user will expect.   Therefore we now reject such
# constructs.
for p in $PRUNEPATHS; do
    case "$p" in
    /*/)   echo "$0: $p: pruned paths should not contain trailing slashes" >&2
           exit 1
    esac
done

# The same, in the form of a regex that find can use.
test -z "$PRUNEREGEX" &&
  PRUNEREGEX=`echo $PRUNEPATHS|sed -e 's,^,\\\(^,' -e 's, ,$\\\)\\\|\\\(^,g' -e 's,$,$\\\),'`

# The database file to build.
: ${LOCATE_DB=/usr/local/var/locate/locatedb}

# Directory to hold intermediate files.
if test -z "$TMPDIR"; then
  if test -d /var/tmp; then
    : ${TMPDIR=/var/tmp}
  elif test -d /usr/tmp; then
    : ${TMPDIR=/usr/tmp}
  else
    : ${TMPDIR=/tmp}
  fi
fi
export TMPDIR

# The user to search network directories as.
: ${NETUSER=daemon}

# The directory containing the subprograms.
if test -n "$LIBEXECDIR" ; then
    : LIBEXECDIR already set, do nothing
else
    : ${LIBEXECDIR=/usr/local/Cellar/findutils/4.7.0/libexec}
fi

# The directory containing find.
if test -n "$BINDIR" ; then
    : BINDIR already set, do nothing
else
    : ${BINDIR=/usr/local/Cellar/findutils/4.7.0/bin}
fi

# The names of the utilities to run to build the database.
: ${find:=${BINDIR}/gfind}
: ${frcode:=${LIBEXECDIR}/gfrcode}

make_tempdir () {
    # This implementation is adapted from the GNU Autoconf manual.
    {
        tmp=`
    (umask 077 && mktemp -d "$TMPDIR/updatedbXXXXXX") 2>/dev/null
    ` &&
        test -n "$tmp" && test -d "$tmp"
    } || {
    # This method is less secure than mktemp -d, but it's a fallback.
    #
    # We use $$ as well as $RANDOM since $RANDOM may not be available.
    # We also add a time-dependent suffix.  This is actually somewhat
    # predictable, but then so is $$.  POSIX does not require date to
    # support +%N.
    ts=`date +%N%S || date +%S 2>/dev/null`
        tmp="$TMPDIR"/updatedb"$$"-"${RANDOM:-}${ts}"
        (umask 077 && mkdir "$tmp")
    }
    echo "$tmp"
}

checkbinary () {
    if test -x "$1" ; then
    : ok
    else
      eval echo "updatedb needs to be able to execute $1, but cannot." >&2
      exit 1
    fi
}

for binary in $find $frcode
do
  checkbinary $binary
done


: ${PRUNEFS="
9P
NFS
afs
autofs
cifs
coda
devfs
devpts
ftpfs
iso9660
mfs
ncpfs
nfs
nfs4
proc
shfs
smbfs
sysfs
"}

if test -n "$PRUNEFS"; then
prunefs_exp=`echo $PRUNEFS |sed -e 's/\([^ ][^ ]*\)/-o -fstype \1/g' \
 -e 's/-o //' -e 's/$/ -o/'`
else
  prunefs_exp=''
fi

# Make and code the file list.
# Sort case insensitively for users' convenience.

rm -f $LOCATE_DB.n
trap 'rm -f $LOCATE_DB.n; exit' HUP TERM

if {
cd "$changeto"
if test -n "$SEARCHPATHS"; then
  if [ "$LOCALUSER" != "" ]; then
    # : A1
    su $LOCALUSER `select_shell $LOCALUSER` -c \
    "$find $SEARCHPATHS $FINDOPTIONS \
     \\( $prunefs_exp \
     -type d -regex '$PRUNEREGEX' \\) -prune -o $print_option"
  else
    # : A2
    # ORIGINAL VERSION : sequential find
    #$find $SEARCHPATHS $FINDOPTIONS \
    # \( $prunefs_exp \
    # -type d -regex "$PRUNEREGEX" \) -prune -o $print_option ::: /

    # Parallel version 1
    #parallel -j 32 $find $SEARCHPATHS $FINDOPTIONS \
    # \( $prunefs_exp \
    # -type d -regex "$PRUNEREGEX" \) -prune -o $print_option ::: /
    
    # Parallel version 2
    parallel -j 32 $find {} $FINDOPTIONS \
    $prunefs_exp -type d -regex $PRUNEREGEX -prune -o $print_option ::: */*
  fi
fi

if test -n "$NETPATHS"; then
myuid=`getuid`
if [ "$myuid" = 0 ]; then
    # : A3
    su $NETUSER `select_shell $NETUSER` -c \
     "$find $NETPATHS $FINDOPTIONS \\( -type d -regex '$PRUNEREGEX' -prune \\) -o $print_option" ||
    exit $?
  else
    # : A4
    $find $NETPATHS $FINDOPTIONS \( -type d -regex "$PRUNEREGEX" -prune \) -o $print_option ||
    exit $?
  fi
fi
} | $sort | $frcode $frcode_options > $LOCATE_DB.n
then
    : OK so far
    true
else
    rv=$?
    echo "Failed to generate $LOCATE_DB.n" >&2
    rm -f $LOCATE_DB.n
    exit $rv
fi

# To avoid breaking locate while this script is running, put the
# results in a temp file, then rename it atomically.
if test -s $LOCATE_DB.n; then
  chmod 644 ${LOCATE_DB}.n
  mv ${LOCATE_DB}.n $LOCATE_DB
else
  echo "updatedb: new database would be empty" >&2
  rm -f $LOCATE_DB.n
fi

exit 0

I launch the gupdatedb command like this:

sudo gupdatedb --prunepaths='/private/tmp /private/var/folders /private/var/tmp */Backups.backupdb /System /Volumes' --localpaths='/' --output=$HOME/locatedb_gupdatedb_PARALLEL

Update 4

My bounty expires tomorrow. Using default gupdatedb , all the indexing takes about 30 minutes. If I could manage to use correctly parallel with the core of gupdatedb script, ie when this latter indexes with gfind command, which gain factor can I expect?

and last request: how to fix the error:

/bin/sh: -c: line 0: syntax error near unexpected token `('
/bin/sh: -c: line 0: `/usr/local/Cellar/findutils/4.7.0/bin/gfind / / ( -fstype 9P -o -fstype NFS -o -fstype afs -o -fstype autofs -o -fstype cifs -o -fstype coda -o -fstype devfs -o -fstype devpts -o -fstype ftpfs -o -fstype iso9660 -o -fstype mfs -o -fstype ncpfs -o -fstype nfs -o -fstype nfs4 -o -fstype proc -o -fstype shfs -o -fstype smbfs -o -fstype sysfs -o -type d -regex \(^/private/tmp$\)\|\(^/private/var/folders$\)\|\(^/private/var/tmp$\)\|\(^*/Backups.backupdb$\)\|\(^/System$\)\|\(^/Volumes$\) ) -prune -o -print0'

with the command:

parallel -j32 $find {} $FINDOPTIONS \
    \( $prunefs_exp \
    -type d -regex "$PRUNEREGEX" \) -prune -o $print_option ::: /

?

Answer 1

You don't need ::: if there's nothing after it, and {} is pointless too if you don't have any sources. Without more information about what exactly you would want to parallelize, we can't really tell you what you should use instead.

But for example, if you want to run one find in each of /etc , /usr , /bin , and /opt , that would look like

parallel find {} -options ::: /etc /usr /bin /opt

This could equivalently be expressed without ::: :

printf '%s\n' /etc /usr /bin /opt |
parallel find {} -options

So the purpose of ::: is basically to say "I want to specify the things to parallelize over on the command line instead of receiving them on standard input"; but if you don't provide this information, either way, parallel doesn't know what to replace {} with.

I'm not saying this particular use makes sense for your use case, just hopefully clarifying the documentation ( again ).

Answer 2

To get any meaningful speedup from using parallel, you need to make sure that you have resources to make the process faster. There are two challenges here:

The updatedb process is IO bound. Usually, you use parallel to take advantages of multi-core system, and spread CPU bound process over multiple cores.
The updatedb process require exclusive access to the database (usually in /var/lib/mlcoate/mlocate.db). Even if you get any benefits from splitting the updatedb over multiple cores, you will have to place the output into multiple databases. This approach will require passing all database names (separate with ':' to locate with '-d')

Unless you system has multiple disk drives (or you are accessing network drives), you will gain very little from running parallel find.

If you system has multiple disk drives (and/or network drives) you can run each file system in parallel, using a script like

Assuming you have 2 additional disks mounted on /mnt/disk1, /mnt/disk2

  # Index root
updatedb --output=/var/lib/mlocate/local.db -E '/mnt/disk1 /mnt/disk2' &
  # Index 1st extra disk (or network drive)
updatedb --output=/var/lib/mlocate/disk1.db -U /mnt/disk1 &
  # Index 2nd extra disk (or network drive)
updatedb --output=/var/lib/mlocate/disk2.db -U /mnt/disk2 &
wait

You should set the environment variable LOCATE_PATH to point to all the databases export

LOCATE_PATH=/var/lib/mlocate/local.db:/var/lib/mlocate/disk1.db:/var/lib/mlocate/disk2.db
locate ...

Using GNU parallel command with gfind to gain in runtime for gupdatedb tool

Question

Update 1

Update 2

Update 3

Update 4

2 answers

solution1
1 ACCPTED 2020-08-14 03:46:25

solution2
1 2020-08-22 16:37:33

Using GNU parallel command with gfind to gain in runtime for gupdatedb tool

Question

Update 1

Update 2

Update 3

Update 4

2 answers

solution1 1 ACCPTED 2020-08-14 03:46:25

solution2 1 2020-08-22 16:37:33

solution1
1 ACCPTED 2020-08-14 03:46:25

solution2
1 2020-08-22 16:37:33