简体   繁体   English

优化sed以进行多次更换

[英]Optimize sed for multiple replacements

I have a file, users.txt , with words like, 我有一个users.txt的文件, users.txt包含诸如

user1
user2
user3

I want to find these words in another file, data.txt and add a prefix to it. 我想在另一个文件data.txt找到这些单词,并为其添加前缀。 data.txt has nearly 500K lines. data.txt有近50万行。 For example, user1 should be replaced with New_user1 and so on. 例如, user1应改为New_user1等。 I have written simple shell script like 我写了像

for user in `cat users.txt`
do
    sed -i 's/'${user}'/New_&/' data.txt
done

For ~1000 words, this program is taking minutes to process, which surprised me because sed is very fast when to comes to find and replace. 对于〜1000个单词,该程序需要花费几分钟的时间来处理,这让我感到惊讶,因为sed在查找和替换时非常快。 I tried to refer to Optimize shell script for multiple sed replacements , but still not much improvement was observed. 我尝试针对多个sed替换引用Optimize shell脚本 ,但是仍然没有观察到太多改进。

Is there any other way to make this process faster? 还有其他方法可以使此过程更快吗?

Sed is known to be very fast (probably only worse than C). 众所周知,Sed速度非常快(可能仅比C差)。

Instead of sed 's/X/Y/g' input.txt , try sed '/X/ s/X/Y/g' input.txt . 代替sed 's/X/Y/g' input.txt ,请尝试sed '/X/ s/X/Y/g' input.txt The latter is known to be faster. 已知后者更快。

Since you only have a "one line at a time semantics", you could run it with parallel (on multi-core cpu-s) like this: 由于一次语义只有一条线,因此可以使用parallel (在多核cpu-s上)运行它,如下所示:

cat huge-file.txt | parallel --pipe sed -e '/xxx/ s/xxx/yyy/g'

If you are working with plain ascii files, you could speed it up by using "C" locale: 如果使用纯ascii文件,则可以使用“ C”语言环境来加快速度:

LC_ALL=C sed -i -e '/xxx/ s/xxx/yyy/g' huge-file.txt

You can turn your users.txt into sed commands like this: 您可以将users.txt转换成sed命令,如下所示:

$ sed 's|.*|s/&/New_&/|' users.txt 
s/user1/New_user1/
s/user2/New_user2/
s/user3/New_user3/

And then use this to process data.txt , either by writing the output of the previous command to an intermediate file, or with process substitution: 然后使用它来处理data.txt ,方法是将前一个命令的输出写入中间文件,或者使用进程替换:

sed -f <(sed 's|.*|s/&/New_&/|' users.txt) data.txt

Your approach goes through all of data.txt for every single line in users.txt , which makes it slow. 您的方法遍历了users.txt每一行的所有data.txt ,这使它变慢了。

If you can't use process substitution, you can use 如果您不能使用流程替换,则可以使用

sed 's|.*|s/&/New_&/|' users.txt | sed -f - data.txt

instead. 代替。

Or.. in one go, we can do something like this. 或者..我们可以一口气做这样的事情。 Let us say, we have a data file with 500k lines. 让我们说,我们有一个500k行的数据文件。

$>    
wc -l data.txt
500001 data.txt

$>    
ls -lrtha data.txt
-rw-rw-r--. 1 gaurav gaurav 16M Oct  5 00:25 data.txt

$>
head -2 data.txt  ; echo ; tail -2 data.txt
0|This is a test file maybe
1|This is a test file maybe

499999|This is a test file maybe
500000|This is a test file maybe

Let us say that our users.txt has 3-4 keywords, which are to be prefixed with "ab_", in the file "data.txt" 假设我们的users.txt在文件“ data.txt”中具有3-4个关键字,这些关键字的前缀为“ ab_”

$>    
cat users.txt
file
maybe
test

So we want to read users.txt and for every word, we want to change that word to a new word. 因此,我们希望读取users.txt,并希望将每个单词更改为一个新单词。 For ex., "file" to "ab_file", "maybe" to "ab_maybe".. 例如,从“文件”到“ ab_file”,从“也许”到“ ab_maybe”。

We can run a while loop, read the input words to be prefixed one by one, and then we run a perl command over the file with the input word stored in a variable. 我们可以运行一个while循环,读取输入单词以一个前缀为前缀,然后在文件上运行一个perl命令,并将输入单词存储在变量中。 In below example, read word is passed to perl command as $word. 在下面的示例中,读字作为$ word传递到perl命令。

I timed this task and this happens fairly quickly. 我为这项任务安排了时间,并且很快就完成了。 Did it on my VM hosted on my windows 10 (using Centos7). 是在Windows 10(使用Centos7)托管的VM上执行的吗?

time cat users.txt |while read word; do  perl -pi -e "s/${word}/ab_${word}/g" data.txt; done        
real    0m1.973s
user    0m1.846s
sys     0m0.127s
$>    
head -2 data.txt  ; echo ; tail -2 data.txt
0|This is a ab_test ab_file ab_maybe
1|This is a ab_test ab_file ab_maybe

499999|This is a ab_test ab_file ab_maybe
500000|This is a ab_test ab_file ab_maybe

In above code, we read the words: test, file, maybe and changed it to ab_test, ab_file, ab_maybe in the data.txt file. 在上面的代码中,我们读取了以下单词:test,file,也许,然后在data.txt文件中将其更改为ab_test,ab_file,ab_maybe。 head and tail count confirms our operation. 头和尾计数证实了我们的操作。

cheers, Gaurav 干杯,高拉夫

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM