优化sed以进行多次更换

Question

I have a file, users.txt , with words like, 我有一个users.txt的文件， users.txt包含诸如

user1
user2
user3

I want to find these words in another file, data.txt and add a prefix to it. 我想在另一个文件data.txt找到这些单词，并为其添加前缀。 data.txt has nearly 500K lines. data.txt有近50万行。 For example, user1 should be replaced with New_user1 and so on. 例如， user1应改为New_user1等。 I have written simple shell script like 我写了像

for user in `cat users.txt`
do
    sed -i 's/'${user}'/New_&/' data.txt
done

For ~1000 words, this program is taking minutes to process, which surprised me because sed is very fast when to comes to find and replace. 对于〜1000个单词，该程序需要花费几分钟的时间来处理，这让我感到惊讶，因为sed在查找和替换时非常快。 I tried to refer to Optimize shell script for multiple sed replacements , but still not much improvement was observed. 我尝试针对多个sed替换引用Optimize shell脚本，但是仍然没有观察到太多改进。

Is there any other way to make this process faster? 还有其他方法可以使此过程更快吗？

Answer 1

Sed is known to be very fast (probably only worse than C). 众所周知，Sed速度非常快（可能仅比C差）。

Instead of sed 's/X/Y/g' input.txt , try sed '/X/ s/X/Y/g' input.txt . 代替sed 's/X/Y/g' input.txt ，请尝试sed '/X/ s/X/Y/g' input.txt 。 The latter is known to be faster. 已知后者更快。

Since you only have a "one line at a time semantics", you could run it with parallel (on multi-core cpu-s) like this: 由于一次语义只有一条线，因此可以使用parallel （在多核cpu-s上）运行它，如下所示：

cat huge-file.txt | parallel --pipe sed -e '/xxx/ s/xxx/yyy/g'

If you are working with plain ascii files, you could speed it up by using "C" locale: 如果使用纯ascii文件，则可以使用“ C”语言环境来加快速度：

LC_ALL=C sed -i -e '/xxx/ s/xxx/yyy/g' huge-file.txt

Answer 2

You can turn your users.txt into sed commands like this: 您可以将users.txt转换成sed命令，如下所示：

$ sed 's|.*|s/&/New_&/|' users.txt 
s/user1/New_user1/
s/user2/New_user2/
s/user3/New_user3/

And then use this to process data.txt , either by writing the output of the previous command to an intermediate file, or with process substitution: 然后使用它来处理data.txt ，方法是将前一个命令的输出写入中间文件，或者使用进程替换：

sed -f <(sed 's|.*|s/&/New_&/|' users.txt) data.txt

Your approach goes through all of data.txt for every single line in users.txt , which makes it slow. 您的方法遍历了users.txt每一行的所有data.txt ，这使它变慢了。

If you can't use process substitution, you can use 如果您不能使用流程替换，则可以使用

sed 's|.*|s/&/New_&/|' users.txt | sed -f - data.txt

instead. 代替。

Answer 3

Or.. in one go, we can do something like this. 或者..我们可以一口气做这样的事情。 Let us say, we have a data file with 500k lines. 让我们说，我们有一个500k行的数据文件。

$>    
wc -l data.txt
500001 data.txt

$>    
ls -lrtha data.txt
-rw-rw-r--. 1 gaurav gaurav 16M Oct  5 00:25 data.txt

$>
head -2 data.txt  ; echo ; tail -2 data.txt
0|This is a test file maybe
1|This is a test file maybe

499999|This is a test file maybe
500000|This is a test file maybe

Let us say that our users.txt has 3-4 keywords, which are to be prefixed with "ab_", in the file "data.txt" 假设我们的users.txt在文件“ data.txt”中具有3-4个关键字，这些关键字的前缀为“ ab_”

$>    
cat users.txt
file
maybe
test

So we want to read users.txt and for every word, we want to change that word to a new word. 因此，我们希望读取users.txt，并希望将每个单词更改为一个新单词。 For ex., "file" to "ab_file", "maybe" to "ab_maybe".. 例如，从“文件”到“ ab_file”，从“也许”到“ ab_maybe”。

We can run a while loop, read the input words to be prefixed one by one, and then we run a perl command over the file with the input word stored in a variable. 我们可以运行一个while循环，读取输入单词以一个前缀为前缀，然后在文件上运行一个perl命令，并将输入单词存储在变量中。 In below example, read word is passed to perl command as $word. 在下面的示例中，读字作为$ word传递到perl命令。

I timed this task and this happens fairly quickly. 我为这项任务安排了时间，并且很快就完成了。 Did it on my VM hosted on my windows 10 (using Centos7). 是在Windows 10（使用Centos7）托管的VM上执行的吗？

time cat users.txt |while read word; do  perl -pi -e "s/${word}/ab_${word}/g" data.txt; done        
real    0m1.973s
user    0m1.846s
sys     0m0.127s
$>    
head -2 data.txt  ; echo ; tail -2 data.txt
0|This is a ab_test ab_file ab_maybe
1|This is a ab_test ab_file ab_maybe

499999|This is a ab_test ab_file ab_maybe
500000|This is a ab_test ab_file ab_maybe

In above code, we read the words: test, file, maybe and changed it to ab_test, ab_file, ab_maybe in the data.txt file. 在上面的代码中，我们读取了以下单词：test，file，也许，然后在data.txt文件中将其更改为ab_test，ab_file，ab_maybe。 head and tail count confirms our operation. 头和尾计数证实了我们的操作。

cheers, Gaurav 干杯，高拉夫

优化sed以进行多次更换

问题描述

3 个解决方案

解决方案1
3 2016-10-04 19:48:57

解决方案2
2 2016-10-04 17:15:36

解决方案3
1 2016-10-04 19:10:32

优化sed以进行多次更换

问题描述

3 个解决方案

解决方案1 3 2016-10-04 19:48:57

解决方案2 2 2016-10-04 17:15:36

解决方案3 1 2016-10-04 19:10:32

解决方案1
3 2016-10-04 19:48:57

解决方案2
2 2016-10-04 17:15:36

解决方案3
1 2016-10-04 19:10:32