[英]Optimize shell script for multiple sed replacements
I have a file containing a list of replacement pairs (about 100 of them) which are used by sed
to replace strings in files.我有一个包含替换对列表(大约 100 个)的文件,
sed
使用这些替换对替换文件中的字符串。
The pairs go like:这对像:
old|new
tobereplaced|replacement
(stuffiwant).*(too)|\1\2
and my current code is:我当前的代码是:
cat replacement_list | while read i
do
old=$(echo "$i" | awk -F'|' '{print $1}') #due to the need for extended regex
new=$(echo "$i" | awk -F'|' '{print $2}')
sed -r "s/`echo "$old"`/`echo "$new"`/g" -i file
done
I cannot help but think that there is a more optimal way of performing the replacements.我不禁认为有一种更优化的方式来执行替换。 I tried turning the loop around to run through lines of the file first but that turned out to be much more expensive.
我尝试转动循环以首先运行文件的行,但结果证明成本要高得多。
Are there any other ways of speeding up this script?还有其他加速这个脚本的方法吗?
EDIT编辑
Thanks for all the quick responses.感谢所有快速回复。 Let me try out the various suggestions before choosing an answer.
在选择答案之前,让我尝试各种建议。
One thing to clear up: I also need subexpressions/groups functionality.需要澄清的一件事:我还需要子表达式/组功能。 For example, one replacement I might need is:
例如,我可能需要的一种替换是:
([0-9])U|\10 #the extra brackets and escapes were required for my original code
Some details on the improvements (to be updated):有关改进的一些细节(待更新):
cut
instead of awk
: 0.71s cut
而不是awk
:0.71s您可以使用sed
生成正确格式的sed
输入:
sed -e 's/^/s|/; s/$/|g/' replacement_list | sed -r -f - file
I recently benchmarked various string replacement methods, among them a custom program, sed -e
, perl -lnpe
and an probably not that widely known MySQL command line utility, replace
.我最近对各种字符串替换方法进行了基准测试,其中包括一个自定义程序
sed -e
、 perl -lnpe
和一个可能不是广为人知的 MySQL 命令行实用程序replace
。 replace
being optimized for string replacements was almost an order of magnitude faster than sed
.为字符串替换而优化的
replace
几乎比sed
快一个数量级。 The results looked something like this (slowest first):结果看起来像这样(最慢的第一):
custom program > sed > LANG=C sed > perl > LANG=C perl > replace
If you want performance, use replace
.如果您想要性能,请使用
replace
。 To have it available on your system, you'll need to install some MySQL distribution, though.不过,要在您的系统上使用它,您需要安装一些 MySQL 发行版。
Replace strings in textfile
替换文本文件中的字符串
This program replaces strings in files or from stdin to stdout.
该程序将文件中的字符串或从标准输入替换为标准输出。 It accepts a list of from-string/to-string pairs and replaces each occurrence of a from-string with the corresponding to-string.
它接受一个 from-string/to-string 对列表,并用相应的 to-string 替换每个出现的 from-string。 The first occurrence of a found string is matched.
匹配找到的字符串的第一次出现。 If there is more than one possibility for the string to replace, longer matches are preferred before shorter matches.
如果字符串替换的可能性不止一种,则在较短的匹配之前优先选择较长的匹配。
...
...
The programs make a DFA-state-machine of the strings and the speed isn't dependent on the count of replace-strings (only of the number of replaces).
这些程序制作了字符串的 DFA 状态机,并且速度不依赖于替换字符串的数量(仅取决于替换的数量)。 A line is assumed ending with \\n or \\0.
假设一行以 \\n 或 \\0 结尾。 There are no limit exept memory on length of strings.
字符串的长度没有限制例外内存。
More on sed.更多关于 sed。 You can utilize multiple cores with sed, by splitting your replacements into #cpus groups and then pipe them through
sed
commands, something like this:您可以将多个内核与 sed 一起使用,方法是将替换项拆分为 #cpus 组,然后通过
sed
命令将它们通过管道传输,如下所示:
$ sed -e 's/A/B/g; ...' file.txt | \
sed -e 's/B/C/g; ...' | \
sed -e 's/C/D/g; ...' | \
sed -e 's/D/E/g; ...' > out
Also, if you use sed
or perl
and your system has an UTF-8 setup, then it also boosts performance to place a LANG=C
in front of the commands:此外,如果您使用
sed
或perl
并且您的系统具有 UTF-8 设置,那么在命令前放置LANG=C
也会提高性能:
$ LANG=C sed ...
You can cut down unnecessary awk invocations and use BASH to break name-value pairs:您可以减少不必要的 awk 调用并使用 BASH 来打破名称-值对:
while IFS='|' read -r old new; do
# echo "$old :: $new"
sed -i "s~$old~$new~g" file
done < replacement_list
IFS='|' IFS='|' will give enable read to populate name-value in 2 different shell variables
old
and new
.将启用读取以填充 2 个不同的 shell 变量
old
和new
的名称值。
This is assuming ~
is not present in your name-value pairs.这是假设
~
不存在于您的名称-值对中。 If that is not the case then feel free to use an alternate sed delimiter.如果不是这种情况,请随意使用备用的 sed 分隔符。
Here is what I would try:这是我会尝试的:
sed
search-replace pair in a Bash array like ;sed
搜索替换对存储在 Bash 数组中,例如;patterns=(
old new
tobereplaced replacement
)
pattern_count=${#patterns[*]} # number of pattern
sedArgs=() # will hold the list of sed arguments
for (( i=0 ; i<$pattern_count ; i=i+2 )); do # don't need to loop on the replacement…
search=${patterns[i]};
replace=${patterns[i+1]}; # … here we got the replacement part
sedArgs+=" -e s/$search/$replace/g"
done
sed ${sedArgs[@]} file
This result in this command:这导致此命令:
sed -es/old/new/g -es/tobereplaced/replacement/g file
sed -es/old/new/g -es/tobereplaced/replacement/g 文件
You can try this.你可以试试这个。
pattern=''
cat replacement_list | while read i
do
old=$(echo "$i" | awk -F'|' '{print $1}') #due to the need for extended regex
new=$(echo "$i" | awk -F'|' '{print $2}')
pattern=${pattern}"s/${old}/${new}/g;"
done
sed -r ${pattern} -i file
This will run the sed command only once on the file with all the replacements.这将仅对包含所有替换的文件运行一次 sed 命令。 You may also want to replace
awk
with cut
.您可能还想用
cut
替换awk
。 cut
may be more optimized then awk
, though I am not sure about that. cut
可能比awk
更优化,尽管我不确定。
old=`echo $i | cut -d"|" -f1`
new=`echo $i | cut -d"|" -f2`
You might want to do the whole thing in awk:你可能想在 awk 中完成整个事情:
awk -F\| 'NR==FNR{old[++n]=$1;new[n]=$2;next}{for(i=1;i<=n;++i)gsub(old[i],new[i])}1' replacement_list file
Build up a list of old and new words from the first file.从第一个文件建立一个新旧单词列表。 The
next
ensures that the rest of the script isn't run on the first file. next
确保脚本的其余部分不在第一个文件上运行。 For the second file, loop through the list of replacements and perform them each one by one.对于第二个文件,遍历替换列表并逐一执行。 The
1
at the end means that the line is printed.末尾的
1
表示该行已打印。
{ cat replacement_list;echo "-End-"; cat YourFile; } | sed -n '1,/-End-/ s/$/³/;1h;1!H;$ {g
t again
:again
/^-End-³\n/ {s///;b done
}
s/^\([^|]*\)|\([^³]*\)³\(\n\)\(.*\)\1/\1|\2³\3\4\2/
t again
s/^[^³]*³\n//
t again
:done
p
}'
More for fun to code via sed.更有趣的是通过 sed 编码。 Try maybe for a time perfomance because this start only 1 sed that is recursif.
尝试一段时间的性能,因为这个开始只有 1 个递归的 sed。
for posix sed (so --posix
with GNU sed)对于 posix sed(所以
--posix
与 GNU sed)
explaination解释
³
and for list with -End-
) for an easier sed handling (hard to use \\n in class character in posix sed.³
行和带有-End-
列表),以便于 sed 处理(在 posix sed 的类字符中很难使用 \\n。-End-³
, remove the line and go to final print-End-³
,请删除该行并进行最终打印t again
)t again
)t again
).t again
)。 T is needed because b
does not reset the test and next t
is always true.b
不会重置测试并且下一个t
始终为真。Thanks to @miku above;感谢上面的@miku;
I have a 100MB file with a list of 80k replacement-strings.我有一个 100MB 的文件,其中包含 80k 个替换字符串的列表。
I tried various combinations of sed's sequentially or parallel, but didn't see throughputs getting shorter than about a 20-hour runtime.我尝试了 sed 顺序或并行的各种组合,但没有看到吞吐量比大约 20 小时的运行时间更短。
Instead I put my list into a sequence of scripts like "cat in | replace aold anew bold bnew cold cnew ... > out ; rm in ; mv out in".相反,我将我的列表放入一系列脚本中,例如“cat in | replace aold anew bold bnew Cold cnew ... > out ; rm in ; mv out in”。
I randomly picked 1000 replacements per file, so it all went like this:我随机为每个文件选择了 1000 个替换,所以一切都是这样的:
# first, split my replace-list into manageable chunks (89 files in this case)
split -a 4 -l 1000 80kReplacePairs rep_
# next, make a 'replace' script out of each chunk
for F in rep_* ; do \
echo "create and make executable a scriptfile" ; \
echo '#!/bin/sh' > run_$F.sh ; chmod +x run_$F.sh ; \
echo "for each chunk-file line, strip line-ends," ; \
echo "then with sed, turn '{long list}' into 'cat in | {long list}' > out" ; \
cat $F | tr '\n' ' ' | sed 's/^/cat in | replace /;s/$/ > out/' >> run_$F.sh ;
echo "and append commands to switch in and out files, for next script" ; \
echo -e " && \\\\ \nrm in && mv out in\n" >> run_$F.sh ; \
done
# put all the replace-scripts in sequence into a main script
ls ./run_rep_aa* > allrun.sh
# make it executable
chmod +x allrun.sh
# run it
nohup ./allrun.sh &
.. which ran in under 5 mins, a lot less than 20 hours ! .. 运行时间不到 5 分钟,不到 20 小时!
Looking back, I could have used more pairs per script, by finding how many lines would make up the limit.回顾过去,我可以通过找出构成限制的行数来为每个脚本使用更多对。
xargs --show-limits </dev/null 2>&1 | grep --color=always "actually use:"
Maximum length of command we could actually use: 2090490
So just under 2MB;所以不到 2MB; how many pairs would that be for my script ?
我的脚本有多少对?
head -c 2090490 80kReplacePairs | wc -l
76923
So it seems I could have used 2 * 40000-line chunks所以看起来我可以使用 2 * 40000 行的块
to expand on chthonicdaemon<\/code> 's solution
扩展
chthonicdaemon<\/code>的解决方案
generate-regex.sh生成-regex.sh
<replacement_list perl -p -0 -e '
s/\//\\\//g;
s/([^\n]+)\n([^\n]+)(?:\n([^\n]+)(?:\n([^\n]+))?)?/s\/\1\/\2\/\3;/g
'
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.