简体   繁体   English

使用sed或awk查找并附加到两个字符串或单词之间的文本

[英]Find and append to Text Between Two Strings or Words using sed or awk

I am looking for a sed in which I can recognize all of the text in between two indicators and then replace it with a place holder. 我正在寻找一个sed ,我可以识别两个指示器之间的所有文本,然后用占位符替换它。

For instance, the 1st indicator is a list of words 例如,第一个指标是单词列表

(no|noone|haven't)

and the 2nd indicator is a list of punctuation Code: 第二个指标是标点符号列表:

(.|,|!)

From an input text such as 从输入文本,如

"Noone understands the plot. There is no storyline. I haven't recommended this movie to my friends! Did you understand it?" “没有人理解情节。没有故事情节。我没有向朋友推荐这部电影!你明白了吗?”

The desired result would be. 期望的结果是。

"Noone understands_AFFIX me_AFFIX. There is no storyline_AFFIX. I haven't recommended_AFFIX this_AFFIX movie_AFFIX to_AFFIX my_AFFIX friends_AFFIX! Did you understand it?" “没人理解_AFFIX me_AFFIX。没有storyline_AFFIX。我没有推荐_AFFIX this_AFFIX movie_AFFIX to_AFFIX my_AFFIX friends_AFFIX!你明白了吗?”

I know that there is the following sed : 我知道有以下sed

sed -n '/WORD1/,/WORD2/p' /path/to/file

which recognizes the content between two indicators. 它承认两个指标之间的内容。 I have also found a lot of great information and resources here . 我还发现了很多伟大的信息和资源在这里 However, I still cannot find a way to append the affix to each token of text that occurs between the two indicators. 但是,我仍然找不到一种方法来将词缀附加到两个指示符之间出现的每个文本标记。

I have also considered to use awk , such as 我也考虑过使用awk ,比如

awk '{sub(/.*indic1 /,"");sub(/ indic2.*/,"");print;}' < infile

yet still, it does not allow me to append the affix. 但是,它不允许我附加词缀。

Does anyone have a suggestion to do so, either with awk or sed ? 有没有人建议这样做,无论是awk还是sed

Perl to the rescue! Perl救援!

perl -pe 's/(?:no(?:one)?|haven'\''t)\s*\K([^.,!]+)/
            join " ", map "${_}_AFFIX", split " ", $1/egi
         ' infile > outfile
  • \\K matches what's on its left, but excludes it from the replacement. \\K匹配左侧的内容,但将其从替换中排除。 In this case, it verifies the 1st indicator. 在这种情况下,它会验证第一个指标。 ( \\K needs Perl 5.10+.) \\K需要Perl 5.10+。)
  • /e evaluates the replacement part as code. /e将替换部件评估为代码。 In this case, the code splits $1 on whitespace, map adds _AFFIX to each of the members, and join joins them back into a string. 在这种情况下,代码在空格上拆分$1 ,map将_AFFIX添加到每个成员,并且join将它们连接回一个字符串。

Here is one verbose awk command for the same: 这是一个详细的awk命令:

s="Noone understands the plot. There is no storyline. I haven't recommended this movie to my friends! Did you understand it?"

awk -v IGNORECASE=1 -v kw="no|noone|haven't" -v pct='\\.|,|!' '{
   a=0
   for (i=2; i<=NF; i++) {
      if ($(i-1) ~ "\\y" kw "\\y")
         a=1
      if (a && $i ~ pct "$") {
         p = substr($i, length($i), 1)
         $i = substr($i, 1, length($i)-1)
      }
      if (a)
         $i=$i "_AFFIX" p
      if(p) {
         p=""
         a=0
      }
   }
} 1'

Output: 输出:

Noone understands_AFFIX the_AFFIX plot_AFFIX. There is no storyline_AFFIX. I haven't recommended_AFFIX this_AFFIX movie_AFFIX to_AFFIX my_AFFIX friends_AFFIX! Did you understand it?

Little more compact awk 更紧凑的awk

$ awk              'BEGIN{RS=ORS=" ";s="_AFFIX"} 
                 /[.,!]$/{f=0; $0=gensub(/(.)$/,"s\\1","g")}  
                        f{$0=$0s} 
    /Noone|no|haven'\''t/{f=1}1' story

Noone understands_AFFIX the_AFFIX plot_AFFIX. 没人理解_AFFIX the_AFFIX plot_AFFIX。 There is no storyline_AFFIX. 没有storyline_AFFIX。 I haven't recommended_AFFIX this_AFFIX movie_AFFIX to_AFFIX my_AFFIX friends_AFFIX! 我没有推荐_AFFIX this_AFFIX movie_AFFIX to_AFFIX my_AFFIX friends_AFFIX! Did you understand it? 你明白了吗?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM