简体   繁体   中英

Find and replace words using sed command not working

I have aa text file which is tab separated, the first column holds the word to be found and the second column holds the word to replace the found word. Once the word is found and replaced it should not be changed again.

For example:

adam    a +dam
a   b

So for a given text file:

adam played with a ball

I expect:

a +dam played with b ball

However, I get:

b +dbm plbyed with b bbll

I am using the following sed command to find and replace:

sed -e 's/^/s%/' -e 's/\t/%/' -e 's/$/%g/' tab_sep_file.txt | sed -f - original_file.txt >replaced.txt

How can I fix this issue

The basic problem to your approach is that you don't want to replace matched text in a prior substitution with a later one - you don't want to change the a 's in a +dam to b 's. This makes sed a pretty poor choice - you can make a regular expression that matches all of the things you want to replace fairly easily, but picking which replacement to use is an issue.

A way using GNU awk :

gawk -F'\t' '
     FNR == NR { subs[$1] = $2; next } # populate the array of substitutions
     ENDFILE {
             if (FILENAME == ARGV[1]) {
                # Build a regular expression of things to substitute
                subre = "\\<("
                first=0
                for (s in subs)
                    subre = sprintf("%s%s%s", subre, first++ ? "|" : "", s)
                subre = sprintf("%s)\\>", subre)
             }
     }
     {
        # Do the substitution
        nwords = patsplit($0, words, subre, between)
        printf "%s", between[0]
        for (n = 1; n <= nwords; n++)
            printf "%s%s", subs[words[n]], between[n]
        printf "\n"
     }
' tab_sep_file.txt original_file.txt

which outputs

a +dam played with b ball

First it reads the TSV file and builds an array of words to be replaced and text to replace it with ( subs ). Then after reading that file, it builds a regular expression to match all possible words to be found - \<(a|adam)\> in this case. The \< and \> match only at the beginning and end, respectively, of words, so the a in ball won't match.

Then for the second file with the text you want to process, it uses patsplit() to split each line into an array of matched parts ( words ) and the bits between matches ( between ), and iterates over the length of the array, printing out the replacement text for each match. That way it avoids re-matching text that's already been replaced.


And a perl version that uses the same approach:

perl -e '
     my %subs;
     open my $words, "<", shift or die $!;
     while (<$words>) {
        chomp;
        my ($word, $rep) = split "\t" ,$_, 2;
        $subs{$word} = $rep;
     }
     my $subre = "\\b(?:" . join("|", map { quotemeta } keys %subs) . ")\\b";
     while (<<>>) {
       print s/$subre/$subs{$&}/egr;
     }
' tab_sep_file.txt original_file.txt

(This one will escape regular expression metacharacters in the words to replace, making it more robust)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM