使用带有 gfind 的 GNU 并行命令来获得 gupdatedb 工具的运行时间

Question

I make follow to the previous post combine parallel and gfind我按照上一篇文章结合并行和 gfind

I would like to build the gupdatedb database, containing all from main root / excepted the PRUNEPATHS listed more below.我想构建 gupdatedb 数据库，包含来自主根目录的所有内容/除了下面列出的PRUNEPATHS 。 I am working on MacOS 10.15 Catalina.我正在使用 MacOS 10.15 Catalina。

So, I tried to modify the gupdatedb script on MacOS 10.15 to benefit from parallel command like this (notice the #: A2 part):因此，我尝试修改 MacOS 10.15 上的 gupdatedb 脚本以从这样的parallel命令中受益（注意#: A2部分）：

# : A2
cat | parallel -j32 $find {} $SEARCHPATHS $FINDOPTIONS \
    \( $prunefs_exp -type d -regex "$PRUNEREGEX" \) \
    -prune -o $print_option * :::

If I don't use cat |如果我不使用cat | , I have the following warning message: ，我有以下警告信息：

parallel: Warning: Input is read from the terminal. You are either an expert
parallel: Warning: (in which case: YOU ARE AWESOME!) or maybe you forgot
parallel: Warning: ::: or :::: or -a or to pipe data into parallel. If so
parallel: Warning: consider going through the tutorial: man parallel_tutorial
parallel: Warning: Press CTRL-D to exit.

and the process seems to hang.并且该过程似乎挂起。

Unfortunately, multiple threads of $find = gfind don't seem to run in the same time:不幸的是， $find = gfind的多个线程似乎不会同时运行：

I have launched the script like this: sudo time gupdatedb我已经启动了这样的脚本： sudo time gupdatedb

and below the result of: ps aux | grep find并低于以下结果： ps aux | grep find ps aux | grep find : ps aux | grep find ：

root             84865   0.0  0.0  4459044  15828 s002  S+    1:43PM   0:00.10 perl /usr/local/bin/parallel -j32 /usr/local/Cellar/findutils/4.7.0/bin/gfind {} / ( -fstype 9P -o -fstype NFS -o -fstype afs -o -fstype autofs -o -fstype cifs -o -fstype coda -o -fstype devfs -o -fstype devpts -o -fstype ftpfs -o -fstype iso9660 -o -fstype mfs -o -fstype ncpfs -o -fstype nfs -o -fstype nfs4 -o -fstype proc -o -fstype shfs -o -fstype smbfs -o -fstype sysfs -o -type d -regex \(^/afs$\)\|\(^/amd$\)\|\(^/proc$\)\|\(^/sfs$\)\|\(^/tmp$\)\|\(^/usr/tmp$\)\|\(^/var/tmp$\)\|\(^/Volumes$\) ) -prune -o -print0 Applications Library System Users Volumes bin cores dev etc home opt private sbin tmp usr var :::
root             84863   0.0  0.0  4268280    796 s002  S+    1:43PM   0:00.00 /usr/local/Cellar/findutils/4.7.0/libexec/gfrcode -0
root             84861   0.0  0.0  4282172    708 s002  S+    1:43PM   0:00.00 /bin/sh /usr/local/Cellar/findutils/4.7.0/libexec/bin/gupdatedb
root             84853   0.0  0.0  4273980   1164 s002  S+    1:43PM   0:00.01 /bin/sh /usr/local/Cellar/findutils/4.7.0/libexec/bin/gupdatedb
root             84850   0.0  0.0  5396228  10288 s008  S+    1:43PM   0:00.27 vim /usr/local/Cellar/findutils/4.7.0/libexec/bin/gupdatedb
root             84849   0.0  0.0  4788896   6740 s008  S+    1:43PM   0:00.03 sudo vim /usr/local/Cellar/findutils/4.7.0/libexec/bin/gupdatedb

Finally, the database may not be built, I am checking the size of: /usr/local/var/locate/locatedb.n and /usr/local/var/locate/locatedb but nothing is changing.最后，可能没有建立数据库，我正在检查： /usr/local/var/locate/locatedb.n和/usr/local/var/locate/locatedb的大小，但没有任何变化。

What's wrong in the syntax I used with parallel?我用于并行的语法有什么问题？ (especially, I don't know how to handle the ... ::: options part of command) （特别是，我不知道如何处理... ::: options部分）

PS: I have set in gupdatedb : PS：我在gupdatedb中设置了：

# Directories to not put in the database, which would otherwise be.
: ${PRUNEPATHS="
/afs
/amd
/proc
/sfs
/tmp
/usr/tmp
/var/tmp
/Volumes
"}

and和

# You can set these in the environment, or use command-line options,
# to override their defaults:

# Any global options for find?
: ${FINDOPTIONS=}

# What shell shoud we use?  We should use a POSIX-ish sh.
: ${SHELL="/bin/sh"}

# Non-network directories to put in the database.
: ${SEARCHPATHS="/"}

Update 1更新 1

To be more accurate, here a post where I ask for a potential optimization (parallelization) with the couple parallel/find :更准确地说，这里有一篇文章，我要求使用这对parallel/find进行潜在的优化（并行化）：

example of a potential parallelization with coupled parallel/find 具有耦合并行/查找的潜在并行化示例

I would like to do the same optimization but for script gupdatedb .我想对脚本gupdatedb做同样的优化。

Update 2更新 2

I followed the advice of:我听从了以下建议：

the defaut command into gupdatedb concerning my issue is:关于我的问题的gupdatedb的默认命令是：

$find $SEARCHPATHS $FINDOPTIONS \
 \( $prunefs_exp \
 -type d -regex "$PRUNEREGEX" \) -prune -o $print_option

So, I have just modified like this:所以，我刚刚修改如下：

parallel -j32 $find {} $SEARCHPATHS $FINDOPTIONS \
    \( $prunefs_exp \
    -type d -regex "$PRUNEREGEX" \) -prune -o $print_option ::: /

and I get the following error:我收到以下错误：

/bin/sh: -c: line 0: syntax error near unexpected token `('
/bin/sh: -c: line 0: `/usr/local/Cellar/findutils/4.7.0/bin/gfind / / ( -fstype 9P -o -fstype NFS -o -fstype afs -o -fstype autofs -o -fstype cifs -o -fstype coda -o -fstype devfs -o -fstype devpts -o -fstype ftpfs -o -fstype iso9660 -o -fstype mfs -o -fstype ncpfs -o -fstype nfs -o -fstype nfs4 -o -fstype proc -o -fstype shfs -o -fstype smbfs -o -fstype sysfs -o -type d -regex \(^/private/tmp$\)\|\(^/private/var/folders$\)\|\(^/private/var/tmp$\)\|\(^*/Backups.backupdb$\)\|\(^/System$\)\|\(^/Volumes$\) ) -prune -o -print0'

What might be wrong here?这里可能有什么问题？

Update 3更新 3

here the script gupdatedb where you can see from line 300 my different tries:这里是脚本gupdatedb ，您可以从第 300 行看到我的不同尝试：

#! /bin/sh
# updatedb -- build a locate pathname database
# Copyright (C) 1994-2019 Free Software Foundation, Inc.
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program.  If not, see <https://www.gnu.org/licenses/>.

# csh original by James Woods; sh conversion by David MacKenzie.

#exec 2> /tmp/updatedb-trace.txt
#set -x

version='
updatedb (GNU findutils) 4.7.0
Copyright (C) 1994-2019 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Eric B. Decker, James Youngman, and Kevin Dalley.
'

# File path names are not actually text, anyway (since there is no
# mechanism to enforce any constraint that the basename of a
# subdirectory has the same character encoding as the basename of its
# parent).  The practical effect is that, depending on the way a
# particular system is configured and the content of its filesystem,
# passing all the file names in the system through "sort" may generate
# character encoding errors in text-based tools like "sort".  To avoid
# this, we set LC_ALL=C.  This will, presumably, not work perfectly on
# systems where LC_ALL is not the way to do locale configuration or
# some other seting can override this.
LC_ALL=C
export LC_ALL

# We can't use substitution on PACKAGE_URL below because it
# (correctly) points to https://www.gnu.org/software/findutils/ instead
# of the bug reporting page.
usage="\
Usage: $0 [--findoptions='-option1 -option2...']
       [--localpaths='dir1 dir2...'] [--netpaths='dir1 dir2...']
       [--prunepaths='dir1 dir2...'] [--prunefs='fs1 fs2...']
       [--output=dbfile] [--netuser=user] [--localuser=user]
       [--dbformat] [--version] [--help]

Please see also the documentation at http://www.gnu.org/software/findutils/.
Report (and track progress on fixing) bugs in the updatedb
program via the GNU findutils bug-reporting page at
https://savannah.gnu.org/bugs/?group=findutils or, if
you have no web access, by sending email to <bug-findutils@gnu.org>.
"
changeto=/

for arg
do
  # If we are unable to fork, the back-tick operator will
  # fail (and the shell will emit an error message).  When
  # this happens, we exit with error value 71 (EX_OSERR).
  # Alternative candidate - 75, EX_TEMPFAIL.
  opt=`echo $arg|sed 's/^\([^=]*\).*/\1/'`  || exit 71
  val=`echo $arg|sed 's/^[^=]*=\(.*\)/\1/'` || exit 71
  case "$opt" in
    --findoptions) FINDOPTIONS="$val" ;;
    --localpaths) SEARCHPATHS="$val" ;;
    --netpaths) NETPATHS="$val" ;;
    --prunepaths) PRUNEPATHS="$val" ;;
    --prunefs) PRUNEFS="$val" ;;
    --output) LOCATE_DB="$val" ;;
    --netuser) NETUSER="$val" ;;
    --localuser) LOCALUSER="$val" ;;
    --changecwd)  changeto="$val" ;;
    --dbformat)   dbformat="$val" ;;
    --version) fail=0; echo "$version" || fail=1; exit $fail ;;
    --help)    fail=0; echo "$usage"   || fail=1; exit $fail ;;
    *) echo "updatedb: invalid option $opt
Try '$0 --help' for more information." >&2
       exit 1 ;;
  esac
done

frcode_options=""
case "$dbformat" in
    "")
        # Default, use LOCATE02
        ;;
    LOCATE02)
        ;;
    slocate)
        frcode_options="$frcode_options -S 1"
        ;;
    *)
        # The "old" database format is no longer supported.
        echo "Unsupported locate database format ${dbformat}: Supported formats are:" >&2
        echo "LOCATE02, slocate" >&2
        exit 1
esac


if true
then
    sort="/usr/bin/sort -z"
    print_option="-print0"
    frcode_options="$frcode_options -0"
else
    sort="/usr/bin/sort"
    print_option="-print"
fi

getuid() {
    # format of "id" output is ...
    # uid=1(daemon) gid=1(other)
    # for `id's that don't understand -u
    id | cut -d'(' -f 1 | cut -d'=' -f2
}

# figure out if su supports the -s option
select_shell() {
    if su "$1" -s $SHELL -c false < /dev/null  ; then
    # No.
    echo ""
    else
    if su "$1" -s $SHELL -c true < /dev/null  ; then
        # Yes.
        echo "-s $SHELL"
        else
        # su is unconditionally failing.  We won't be able to
        # figure out what is wrong, so be conservative.
        echo ""
    fi
    fi
}


# You can set these in the environment, or use command-line options,
# to override their defaults:

# Any global options for find?
: ${FINDOPTIONS="-mindepth 1 -maxdepth 1"}
#: ${FINDOPTIONS=""}

# What shell shoud we use?  We should use a POSIX-ish sh.
: ${SHELL="/bin/sh"}

# Non-network directories to put in the database.
: ${SEARCHPATHS="/"}

# Network (NFS, AFS, RFS, etc.) directories to put in the database.
: ${NETPATHS=}

# Directories to not put in the database, which would otherwise be.
: ${PRUNEPATHS="
/afs
/amd
/proc
/sfs
/tmp
/usr/tmp
/var/tmp
"}

# Trailing slashes result in regex items that are never matched, which
# is not what the user will expect.   Therefore we now reject such
# constructs.
for p in $PRUNEPATHS; do
    case "$p" in
    /*/)   echo "$0: $p: pruned paths should not contain trailing slashes" >&2
           exit 1
    esac
done

# The same, in the form of a regex that find can use.
test -z "$PRUNEREGEX" &&
  PRUNEREGEX=`echo $PRUNEPATHS|sed -e 's,^,\\\(^,' -e 's, ,$\\\)\\\|\\\(^,g' -e 's,$,$\\\),'`

# The database file to build.
: ${LOCATE_DB=/usr/local/var/locate/locatedb}

# Directory to hold intermediate files.
if test -z "$TMPDIR"; then
  if test -d /var/tmp; then
    : ${TMPDIR=/var/tmp}
  elif test -d /usr/tmp; then
    : ${TMPDIR=/usr/tmp}
  else
    : ${TMPDIR=/tmp}
  fi
fi
export TMPDIR

# The user to search network directories as.
: ${NETUSER=daemon}

# The directory containing the subprograms.
if test -n "$LIBEXECDIR" ; then
    : LIBEXECDIR already set, do nothing
else
    : ${LIBEXECDIR=/usr/local/Cellar/findutils/4.7.0/libexec}
fi

# The directory containing find.
if test -n "$BINDIR" ; then
    : BINDIR already set, do nothing
else
    : ${BINDIR=/usr/local/Cellar/findutils/4.7.0/bin}
fi

# The names of the utilities to run to build the database.
: ${find:=${BINDIR}/gfind}
: ${frcode:=${LIBEXECDIR}/gfrcode}

make_tempdir () {
    # This implementation is adapted from the GNU Autoconf manual.
    {
        tmp=`
    (umask 077 && mktemp -d "$TMPDIR/updatedbXXXXXX") 2>/dev/null
    ` &&
        test -n "$tmp" && test -d "$tmp"
    } || {
    # This method is less secure than mktemp -d, but it's a fallback.
    #
    # We use $$ as well as $RANDOM since $RANDOM may not be available.
    # We also add a time-dependent suffix.  This is actually somewhat
    # predictable, but then so is $$.  POSIX does not require date to
    # support +%N.
    ts=`date +%N%S || date +%S 2>/dev/null`
        tmp="$TMPDIR"/updatedb"$$"-"${RANDOM:-}${ts}"
        (umask 077 && mkdir "$tmp")
    }
    echo "$tmp"
}

checkbinary () {
    if test -x "$1" ; then
    : ok
    else
      eval echo "updatedb needs to be able to execute $1, but cannot." >&2
      exit 1
    fi
}

for binary in $find $frcode
do
  checkbinary $binary
done


: ${PRUNEFS="
9P
NFS
afs
autofs
cifs
coda
devfs
devpts
ftpfs
iso9660
mfs
ncpfs
nfs
nfs4
proc
shfs
smbfs
sysfs
"}

if test -n "$PRUNEFS"; then
prunefs_exp=`echo $PRUNEFS |sed -e 's/\([^ ][^ ]*\)/-o -fstype \1/g' \
 -e 's/-o //' -e 's/$/ -o/'`
else
  prunefs_exp=''
fi

# Make and code the file list.
# Sort case insensitively for users' convenience.

rm -f $LOCATE_DB.n
trap 'rm -f $LOCATE_DB.n; exit' HUP TERM

if {
cd "$changeto"
if test -n "$SEARCHPATHS"; then
  if [ "$LOCALUSER" != "" ]; then
    # : A1
    su $LOCALUSER `select_shell $LOCALUSER` -c \
    "$find $SEARCHPATHS $FINDOPTIONS \
     \\( $prunefs_exp \
     -type d -regex '$PRUNEREGEX' \\) -prune -o $print_option"
  else
    # : A2
    # ORIGINAL VERSION : sequential find
    #$find $SEARCHPATHS $FINDOPTIONS \
    # \( $prunefs_exp \
    # -type d -regex "$PRUNEREGEX" \) -prune -o $print_option ::: /

    # Parallel version 1
    #parallel -j 32 $find $SEARCHPATHS $FINDOPTIONS \
    # \( $prunefs_exp \
    # -type d -regex "$PRUNEREGEX" \) -prune -o $print_option ::: /
    
    # Parallel version 2
    parallel -j 32 $find {} $FINDOPTIONS \
    $prunefs_exp -type d -regex $PRUNEREGEX -prune -o $print_option ::: */*
  fi
fi

if test -n "$NETPATHS"; then
myuid=`getuid`
if [ "$myuid" = 0 ]; then
    # : A3
    su $NETUSER `select_shell $NETUSER` -c \
     "$find $NETPATHS $FINDOPTIONS \\( -type d -regex '$PRUNEREGEX' -prune \\) -o $print_option" ||
    exit $?
  else
    # : A4
    $find $NETPATHS $FINDOPTIONS \( -type d -regex "$PRUNEREGEX" -prune \) -o $print_option ||
    exit $?
  fi
fi
} | $sort | $frcode $frcode_options > $LOCATE_DB.n
then
    : OK so far
    true
else
    rv=$?
    echo "Failed to generate $LOCATE_DB.n" >&2
    rm -f $LOCATE_DB.n
    exit $rv
fi

# To avoid breaking locate while this script is running, put the
# results in a temp file, then rename it atomically.
if test -s $LOCATE_DB.n; then
  chmod 644 ${LOCATE_DB}.n
  mv ${LOCATE_DB}.n $LOCATE_DB
else
  echo "updatedb: new database would be empty" >&2
  rm -f $LOCATE_DB.n
fi

exit 0

I launch the gupdatedb command like this:我像这样启动gupdatedb命令：

sudo gupdatedb --prunepaths='/private/tmp /private/var/folders /private/var/tmp */Backups.backupdb /System /Volumes' --localpaths='/' --output=$HOME/locatedb_gupdatedb_PARALLEL

Update 4更新 4

My bounty expires tomorrow.我的赏金明天到期。 Using default gupdatedb , all the indexing takes about 30 minutes.使用默认gupdatedb ，所有索引大约需要 30 分钟。 If I could manage to use correctly parallel with the core of gupdatedb script, ie when this latter indexes with gfind command, which gain factor can I expect?如果我能设法正确地与gupdatedb脚本的核心parallel使用，即当后者使用gfind命令索引时，我可以期待哪个增益因子？

and last request: how to fix the error:最后一个请求：如何修复错误：

/bin/sh: -c: line 0: syntax error near unexpected token `('
/bin/sh: -c: line 0: `/usr/local/Cellar/findutils/4.7.0/bin/gfind / / ( -fstype 9P -o -fstype NFS -o -fstype afs -o -fstype autofs -o -fstype cifs -o -fstype coda -o -fstype devfs -o -fstype devpts -o -fstype ftpfs -o -fstype iso9660 -o -fstype mfs -o -fstype ncpfs -o -fstype nfs -o -fstype nfs4 -o -fstype proc -o -fstype shfs -o -fstype smbfs -o -fstype sysfs -o -type d -regex \(^/private/tmp$\)\|\(^/private/var/folders$\)\|\(^/private/var/tmp$\)\|\(^*/Backups.backupdb$\)\|\(^/System$\)\|\(^/Volumes$\) ) -prune -o -print0'

with the command:使用命令：

parallel -j32 $find {} $FINDOPTIONS \
    \( $prunefs_exp \
    -type d -regex "$PRUNEREGEX" \) -prune -o $print_option ::: /

? ?

Answer 1

You don't need ::: if there's nothing after it, and {} is pointless too if you don't have any sources.如果后面没有任何内容，则不需要::: ，如果您没有任何来源， {}也毫无意义。 Without more information about what exactly you would want to parallelize, we can't really tell you what you should use instead.如果没有更多关于您想要并行化的确切信息，我们无法真正告诉您应该使用什么。

But for example, if you want to run one find in each of /etc , /usr , /bin , and /opt , that would look like但是例如，如果您想在/etc 、 /usr 、 /bin和/opt中的每一个中运行一个find ，那看起来像

parallel find {} -options ::: /etc /usr /bin /opt

This could equivalently be expressed without ::: :这可以等效地表示为没有::: ：

printf '%s\n' /etc /usr /bin /opt |
parallel find {} -options

So the purpose of ::: is basically to say "I want to specify the things to parallelize over on the command line instead of receiving them on standard input";所以:::的目的基本上是说“我想在命令行上指定要并行化的东西，而不是在标准输入上接收它们”； but if you don't provide this information, either way, parallel doesn't know what to replace {} with.但如果您不提供此信息，无论哪种方式， parallel都不知道用什么替换{} 。

I'm not saying this particular use makes sense for your use case, just hopefully clarifying the documentation ( again ).我并不是说这种特殊用途对您的用例有意义，只是希望澄清文档（再次）。

Answer 2

To get any meaningful speedup from using parallel, you need to make sure that you have resources to make the process faster.要通过使用并行获得任何有意义的加速，您需要确保您有资源来加快进程。 There are two challenges here:这里有两个挑战：

The updatedb process is IO bound. updatedb 进程是 IO 绑定的。 Usually, you use parallel to take advantages of multi-core system, and spread CPU bound process over multiple cores.通常，您使用并行来利用多核系统，并将 CPU 绑定进程分布在多个核上。
The updatedb process require exclusive access to the database (usually in /var/lib/mlcoate/mlocate.db). updatedb 进程需要独占访问数据库（通常在 /var/lib/mlcoate/mlocate.db 中）。 Even if you get any benefits from splitting the updatedb over multiple cores, you will have to place the output into multiple databases.即使您从将更新数据库拆分到多个内核中获得任何好处，您也必须将 output 放入多个数据库中。 This approach will require passing all database names (separate with ':' to locate with '-d')这种方法需要传递所有数据库名称（用'：'分隔以用'-d'定位）

Unless you system has multiple disk drives (or you are accessing network drives), you will gain very little from running parallel find.除非您的系统有多个磁盘驱动器（或者您正在访问网络驱动器），否则您将从运行并行查找中获得很少的收益。

If you system has multiple disk drives (and/or network drives) you can run each file system in parallel, using a script like如果您的系统有多个磁盘驱动器（和/或网络驱动器），您可以使用类似的脚本并行运行每个文件系统

Assuming you have 2 additional disks mounted on /mnt/disk1, /mnt/disk2假设您在 /mnt/disk1、/mnt/disk2 上安装了 2 个附加磁盘

  # Index root
updatedb --output=/var/lib/mlocate/local.db -E '/mnt/disk1 /mnt/disk2' &
  # Index 1st extra disk (or network drive)
updatedb --output=/var/lib/mlocate/disk1.db -U /mnt/disk1 &
  # Index 2nd extra disk (or network drive)
updatedb --output=/var/lib/mlocate/disk2.db -U /mnt/disk2 &
wait

You should set the environment variable LOCATE_PATH to point to all the databases export您应该将环境变量 LOCATE_PATH 设置为指向所有数据库导出

LOCATE_PATH=/var/lib/mlocate/local.db:/var/lib/mlocate/disk1.db:/var/lib/mlocate/disk2.db
locate ...

使用带有 gfind 的 GNU 并行命令来获得 gupdatedb 工具的运行时间

问题描述

Update 1更新 1

Update 2更新 2

Update 3更新 3

Update 4更新 4

2 个解决方案

解决方案1
1 已采纳 2020-08-14 03:46:25

解决方案2
1 2020-08-22 16:37:33

使用带有 gfind 的 GNU 并行命令来获得 gupdatedb 工具的运行时间

问题描述

Update 1更新 1

Update 2更新 2

Update 3更新 3

Update 4更新 4

2 个解决方案

解决方案1 1 已采纳 2020-08-14 03:46:25

解决方案2 1 2020-08-22 16:37:33

解决方案1
1 已采纳 2020-08-14 03:46:25

解决方案2
1 2020-08-22 16:37:33