简体   繁体   中英

Find and append to Text Between Two Strings or Words using sed or awk

I am looking for a sed in which I can recognize all of the text in between two indicators and then replace it with a place holder.

For instance, the 1st indicator is a list of words

(no|noone|haven't)

and the 2nd indicator is a list of punctuation Code:

(.|,|!)

From an input text such as

"Noone understands the plot. There is no storyline. I haven't recommended this movie to my friends! Did you understand it?"

The desired result would be.

"Noone understands_AFFIX me_AFFIX. There is no storyline_AFFIX. I haven't recommended_AFFIX this_AFFIX movie_AFFIX to_AFFIX my_AFFIX friends_AFFIX! Did you understand it?"

I know that there is the following sed :

sed -n '/WORD1/,/WORD2/p' /path/to/file

which recognizes the content between two indicators. I have also found a lot of great information and resources here . However, I still cannot find a way to append the affix to each token of text that occurs between the two indicators.

I have also considered to use awk , such as

awk '{sub(/.*indic1 /,"");sub(/ indic2.*/,"");print;}' < infile

yet still, it does not allow me to append the affix.

Does anyone have a suggestion to do so, either with awk or sed ?

Perl to the rescue!

perl -pe 's/(?:no(?:one)?|haven'\''t)\s*\K([^.,!]+)/
            join " ", map "${_}_AFFIX", split " ", $1/egi
         ' infile > outfile
  • \\K matches what's on its left, but excludes it from the replacement. In this case, it verifies the 1st indicator. ( \\K needs Perl 5.10+.)
  • /e evaluates the replacement part as code. In this case, the code splits $1 on whitespace, map adds _AFFIX to each of the members, and join joins them back into a string.

Here is one verbose awk command for the same:

s="Noone understands the plot. There is no storyline. I haven't recommended this movie to my friends! Did you understand it?"

awk -v IGNORECASE=1 -v kw="no|noone|haven't" -v pct='\\.|,|!' '{
   a=0
   for (i=2; i<=NF; i++) {
      if ($(i-1) ~ "\\y" kw "\\y")
         a=1
      if (a && $i ~ pct "$") {
         p = substr($i, length($i), 1)
         $i = substr($i, 1, length($i)-1)
      }
      if (a)
         $i=$i "_AFFIX" p
      if(p) {
         p=""
         a=0
      }
   }
} 1'

Output:

Noone understands_AFFIX the_AFFIX plot_AFFIX. There is no storyline_AFFIX. I haven't recommended_AFFIX this_AFFIX movie_AFFIX to_AFFIX my_AFFIX friends_AFFIX! Did you understand it?

Little more compact awk

$ awk              'BEGIN{RS=ORS=" ";s="_AFFIX"} 
                 /[.,!]$/{f=0; $0=gensub(/(.)$/,"s\\1","g")}  
                        f{$0=$0s} 
    /Noone|no|haven'\''t/{f=1}1' story

Noone understands_AFFIX the_AFFIX plot_AFFIX. There is no storyline_AFFIX. I haven't recommended_AFFIX this_AFFIX movie_AFFIX to_AFFIX my_AFFIX friends_AFFIX! Did you understand it?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM