简体   繁体   English

"优化多个 sed 替换的 shell 脚本"

[英]Optimize shell script for multiple sed replacements

I have a file containing a list of replacement pairs (about 100 of them) which are used by sed to replace strings in files.我有一个包含替换对列表(大约 100 个)的文件, sed使用这些替换对替换文件中的字符串。

The pairs go like:这对像:

old|new
tobereplaced|replacement
(stuffiwant).*(too)|\1\2

and my current code is:我当前的代码是:

cat replacement_list | while read i
do
    old=$(echo "$i" | awk -F'|' '{print $1}')    #due to the need for extended regex
    new=$(echo "$i" | awk -F'|' '{print $2}')
    sed -r "s/`echo "$old"`/`echo "$new"`/g" -i file
done

I cannot help but think that there is a more optimal way of performing the replacements.我不禁认为有一种更优化的方式来执行替换。 I tried turning the loop around to run through lines of the file first but that turned out to be much more expensive.我尝试转动循环以首先运行文件的行,但结果证明成本要高得多。

Are there any other ways of speeding up this script?还有其他加速这个脚本的方法吗?

EDIT编辑

Thanks for all the quick responses.感谢所有快速回复。 Let me try out the various suggestions before choosing an answer.在选择答案之前,让我尝试各种建议。

One thing to clear up: I also need subexpressions/groups functionality.需要澄清的一件事:我还需要子表达式/组功能。 For example, one replacement I might need is:例如,我可能需要的一种替换是:

([0-9])U|\10  #the extra brackets and escapes were required for my original code

Some details on the improvements (to be updated):有关改进的一些细节(待更新):

  • Method: processing time方法:处理时间
  • Original script: 0.85s原始脚本:0.85s
  • cut instead of awk : 0.71s cut而不是awk :0.71s
  • anubhava's method: 0.18s阿努巴瓦法:0.18s
  • chthonicdaemon's method: 0.01s chthonicdaemon 的方法:0.01s

您可以使用sed生成正确格式的sed输入:

sed -e 's/^/s|/; s/$/|g/' replacement_list | sed -r -f - file

I recently benchmarked various string replacement methods, among them a custom program, sed -e , perl -lnpe and an probably not that widely known MySQL command line utility, replace .我最近对各种字符串替换方法进行了基准测试,其中包括一个自定义程序sed -eperl -lnpe和一个可能不是广为人知的 MySQL 命令行实用程序replace replace being optimized for string replacements was almost an order of magnitude faster than sed .为字符串替换而优化的replace几乎比sed快一个数量级。 The results looked something like this (slowest first):结果看起来像这样(最慢的第一):

custom program > sed > LANG=C sed > perl > LANG=C perl > replace

If you want performance, use replace .如果您想要性能,请使用replace To have it available on your system, you'll need to install some MySQL distribution, though.不过,要在您的系统上使用它,您需要安装一些 MySQL 发行版。

From replace.c :replace.c

Replace strings in textfile替换文本文件中的字符串

This program replaces strings in files or from stdin to stdout.该程序将文件中的字符串或从标准输入替换为标准输出。 It accepts a list of from-string/to-string pairs and replaces each occurrence of a from-string with the corresponding to-string.它接受一个 from-string/to-string 对列表,并用相应的 to-string 替换每个出现的 from-string。 The first occurrence of a found string is matched.匹配找到的字符串的第一次出现。 If there is more than one possibility for the string to replace, longer matches are preferred before shorter matches.如果字符串替换的可能性不止一种,则在较短的匹配之前优先选择较长的匹配。

... ...

The programs make a DFA-state-machine of the strings and the speed isn't dependent on the count of replace-strings (only of the number of replaces).这些程序制作了字符串的 DFA 状态机,并且速度不依赖于替换字符串的数量(仅取决于替换的数量)。 A line is assumed ending with \\n or \\0.假设一行以 \\n 或 \\0 结尾。 There are no limit exept memory on length of strings.字符串的长度没有限制例外内存。


More on sed.更多关于 sed。 You can utilize multiple cores with sed, by splitting your replacements into #cpus groups and then pipe them through sed commands, something like this:您可以将多个内核与 sed 一起使用,方法是将替换项拆分为 #cpus 组,然后通过sed命令将它们通过管道传输,如下所示:

$ sed -e 's/A/B/g; ...' file.txt | \
  sed -e 's/B/C/g; ...' | \
  sed -e 's/C/D/g; ...' | \
  sed -e 's/D/E/g; ...' > out

Also, if you use sed or perl and your system has an UTF-8 setup, then it also boosts performance to place a LANG=C in front of the commands:此外,如果您使用sedperl并且您的系统具有 UTF-8 设置,那么在命令前放置LANG=C也会提高性能:

$ LANG=C sed ...

You can cut down unnecessary awk invocations and use BASH to break name-value pairs:您可以减少不必要的 awk 调用并使用 BASH 来打破名称-值对:

while IFS='|' read -r old new; do
   # echo "$old :: $new"
   sed -i "s~$old~$new~g" file
done < replacement_list

IFS='|' IFS='|' will give enable read to populate name-value in 2 different shell variables old and new .将启用读取以填充 2 个不同的 shell 变量oldnew的名称值。

This is assuming ~ is not present in your name-value pairs.这是假设~不存在于您的名称-值对中。 If that is not the case then feel free to use an alternate sed delimiter.如果不是这种情况,请随意使用备用的 sed 分隔符。

Here is what I would try:这是我会尝试的:

  1. store your sed search-replace pair in a Bash array like ;将您的sed搜索替换对存储在 Bash 数组中,例如;
  2. build your sed command based on this array using parameter expansion使用参数扩展基于此数组构建您的 sed 命令
  3. run command.运行命令。
patterns=(
  old new
  tobereplaced replacement
)
pattern_count=${#patterns[*]} # number of pattern
sedArgs=() # will hold the list of sed arguments

for (( i=0 ; i<$pattern_count ; i=i+2 )); do # don't need to loop on the replacement…
  search=${patterns[i]};
  replace=${patterns[i+1]}; # … here we got the replacement part
  sedArgs+=" -e s/$search/$replace/g"
done
sed ${sedArgs[@]} file

This result in this command:这导致此命令:

sed -es/old/new/g -es/tobereplaced/replacement/g file sed -es/old/new/g -es/tobereplaced/replacement/g 文件

You can try this.你可以试试这个。

pattern=''
cat replacement_list | while read i
do
    old=$(echo "$i" | awk -F'|' '{print $1}')    #due to the need for extended regex
    new=$(echo "$i" | awk -F'|' '{print $2}')
    pattern=${pattern}"s/${old}/${new}/g;"
done
sed -r ${pattern} -i file

This will run the sed command only once on the file with all the replacements.这将仅对包含所有替换的文件运行一次 sed 命令。 You may also want to replace awk with cut .您可能还想用cut替换awk cut may be more optimized then awk , though I am not sure about that. cut可能比awk更优化,尽管我不确定。

old=`echo $i | cut -d"|" -f1`
new=`echo $i | cut -d"|" -f2`

You might want to do the whole thing in awk:你可能想在 awk 中完成整个事情:

awk -F\| 'NR==FNR{old[++n]=$1;new[n]=$2;next}{for(i=1;i<=n;++i)gsub(old[i],new[i])}1' replacement_list file

Build up a list of old and new words from the first file.从第一个文件建立一个新旧单词列表。 The next ensures that the rest of the script isn't run on the first file. next确保脚本的其余部分不在第一个文件上运行。 For the second file, loop through the list of replacements and perform them each one by one.对于第二个文件,遍历替换列表并逐一执行。 The 1 at the end means that the line is printed.末尾的1表示该行已打印。

{ cat replacement_list;echo "-End-"; cat YourFile; } | sed -n '1,/-End-/ s/$/³/;1h;1!H;$ {g
t again
:again
   /^-End-³\n/ {s///;b done
      }
   s/^\([^|]*\)|\([^³]*\)³\(\n\)\(.*\)\1/\1|\2³\3\4\2/
   t again
   s/^[^³]*³\n//
   t again
:done
  p
  }'

More for fun to code via sed.更有趣的是通过 sed 编码。 Try maybe for a time perfomance because this start only 1 sed that is recursif.尝试一段时间的性能,因为这个开始只有 1 个递归的 sed。

for posix sed (so --posix with GNU sed)对于 posix sed(所以--posix与 GNU sed)

explaination解释

  • copy replacement list in front of file content with a delimiter (for line with ³ and for list with -End- ) for an easier sed handling (hard to use \\n in class character in posix sed.使用分隔符复制文件内容前面的替换列表(对于带有³行和带有-End-列表),以便于 sed 处理(在 posix sed 的类字符中很难使用 \\n。
  • place all line in buffer (add the delimiter of line for replacement list and -End- before)将所有行放入缓冲区(在替换列表之前添加行的分隔符和 -End-)
  • if this is -End-³ , remove the line and go to final print如果这是-End-³ ,请删除该行并进行最终打印
  • replace each first pattern (group 1) found in text by second patttern (group 2)用第二个模式(第 2 组)替换文本中找到的每个第一个模式(第 1 组)
  • if found, restart ( t again )如果找到,重新启动( t again
  • remove first line删除第一行
  • restart process ( t again ).重新启动进程( t again )。 T is needed because b does not reset the test and next t is always true.需要 T 是因为b不会重置测试并且下一个t始终为真。

Thanks to @miku above;感谢上面的@miku;

I have a 100MB file with a list of 80k replacement-strings.我有一个 100MB 的文件,其中包含 80k 个替换字符串的列表。

I tried various combinations of sed's sequentially or parallel, but didn't see throughputs getting shorter than about a 20-hour runtime.我尝试了 sed 顺序或并行的各种组合,但没有看到吞吐量比大约 20 小时的运行时间更短。

Instead I put my list into a sequence of scripts like "cat in | replace aold anew bold bnew cold cnew ... > out ; rm in ; mv out in".相反,我将我的列表放入一系列脚本中,例如“cat in | replace aold anew bold bnew Cold cnew ... > out ; rm in ; mv out in”。

I randomly picked 1000 replacements per file, so it all went like this:我随机为每个文件选择了 1000 个替换,所以一切都是这样的:

# first, split my replace-list into manageable chunks (89 files in this case)
split -a 4 -l 1000 80kReplacePairs rep_

# next, make a 'replace' script out of each chunk
for F in rep_* ; do \
    echo "create and make executable a scriptfile" ; \
    echo '#!/bin/sh' > run_$F.sh ; chmod +x run_$F.sh ; \
    echo "for each chunk-file line, strip line-ends," ; \
    echo "then with sed, turn '{long list}' into 'cat in | {long list}' > out" ; \
    cat $F | tr '\n' ' ' | sed 's/^/cat in | replace /;s/$/ > out/' >> run_$F.sh ;
    echo "and append commands to switch in and out files, for next script" ; \
    echo -e " && \\\\ \nrm in && mv out in\n" >> run_$F.sh ; \
done

# put all the replace-scripts in sequence into a main script
ls ./run_rep_aa* > allrun.sh

# make it executable
chmod +x allrun.sh 

# run it
nohup ./allrun.sh &

.. which ran in under 5 mins, a lot less than 20 hours ! .. 运行时间不到 5 分钟,不到 20 小时!

Looking back, I could have used more pairs per script, by finding how many lines would make up the limit.回顾过去,我可以通过找出构成限制的行数来为每个脚本使用更多对。

xargs --show-limits </dev/null 2>&1 | grep --color=always "actually use:"
    Maximum length of command we could actually use: 2090490

So just under 2MB;所以不到 2MB; how many pairs would that be for my script ?我的脚本有多少对?

head -c 2090490 80kReplacePairs | wc -l

    76923

So it seems I could have used 2 * 40000-line chunks所以看起来我可以使用 2 * 40000 行的块

to expand on chthonicdaemon<\/code> 's solution扩展chthonicdaemon<\/code>的解决方案

generate-regex.sh生成-regex.sh

<replacement_list perl -p -0 -e '
  s/\//\\\//g;
  s/([^\n]+)\n([^\n]+)(?:\n([^\n]+)(?:\n([^\n]+))?)?/s\/\1\/\2\/\3;/g
'

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM