繁体   English   中英

使用 awk 将文件与两个单独的查找文件进行比较

[英]Compare a file with two separate lookup files using awk

基本上,我想检查我的 xyz.txt 文件中是否存在 lookup_1 和 lookup_2 中的字符串,然后执行操作并将 output 重定向到 output 文件。 此外,我的代码目前正在替换 lookup_1 中所有出现的字符串,甚至替换为 substring,但我只需要在整个单词匹配时替换它。 您能否帮助调整代码以实现相同的目标?

代码

awk '
FNR==NR { if ($0 in lookups)    
             next                            
          lookups[$0]=$0
          for (i=1;i<=NF;i++) {         
              oldstr=$i
              newstr=""
              while (oldstr) {               
                    len=length(oldstr)
                    newstr=newstr substr(oldstr,1,1) substr("##",1,len-1)
                    oldstr=substr(oldstr,4)   
              }
              ndx=index(lookups[$0],$i)   
              lookups[$0]=substr(lookups[$0],1,ndx-1) newstr substr(lookups[$0],ndx+length($i))
          }
          next
        }

        { for (i in lookups) { 
              ndx=index($0,i)                
              while (ndx > 0) {                       t
                    $0=substr($0,1,ndx-1) lookups[i] substr($0,ndx+length(lookups[i]))
                    ndx=index($0,i)                    
              }
          }
          print
        }
' lookup_1 xyz.txt > output.txt

lookup_1

ha
achine
skhatw
at
ree
ter
man
dun

lookup_2

United States
CDEXX123X
Institution

xyz文件

[1] [hamilton] This is a demo file
Demo file is currently being reviewed by user ter
[2] [ter] This is a demo file 
Demo file is currently being edited by user skhatw
Internal Machine's Change Request being processed. Approved by user mandeep
Institution code is 'CDEXX123X' where country is United States

当前 output

[1] [h#milton] This is a demo file
Demo file is currently being reviewed by user t##
[2] [t##] This is a demo file 
Demo file is currently being edited by user skh#tw
Internal Ma##i##'s Ch#nge Request being processed. Approved by user m##deep
Institution code is 'CDEXX123X' where country is United States

希望 output

[1] [hamilton] This is a demo file
Demo file is currently being reviewed by user t##
[2] [t##] This is a demo file 
Demo file is currently being edited by user s##a##
Internal Machine's Change Request being processed. Approved by user mandeep
I##t##u##o# code is 'C##X##2##' where country is U##t## S##t##


我们可以对当前代码进行一些更改:

  • cat lookup_1 lookup_2的结果馈送到awk中,这样它看起来就像一个到awk的单个文件(参见新代码的最后一行)
  • 使用单词边界标志( \<\> )来构建用于执行替换的正则表达式(参见新代码的第二部分)

新代码:

awk '
        # the FNR==NR block of code remains the same

FNR==NR { if ($0 in lookups)
             next
          lookups[$0]=$0
          for (i=1;i<=NF;i++) {
              oldstr=$i
              newstr=""
              while (oldstr) {
                    len=length(oldstr)
                    newstr=newstr substr(oldstr,1,1) substr("##",1,len-1)
                    oldstr=substr(oldstr,4)
              }
              ndx=index(lookups[$0],$i)
              lookups[$0]=substr(lookups[$0],1,ndx-1) newstr substr(lookups[$0],ndx+length($i))
          }
          next
        }

        # complete rewrite of the following block to perform replacements based on a regex using word boundaries

        { for (i in lookups) {
              regex= "\\<" i "\\>"            # build regex
              gsub(regex,lookups[i])          # replace strings that match regex
          }
          print
        }
' <(cat lookup_1 lookup_2) xyz.txt            # combine lookup_1/lookup_2 into a single stream so both files are processed under the FNR==NR block of code

这会产生:

[1] [hamilton] This is a demo file
Demo file is currently being reviewed by user t##
[2] [t##] This is a demo file
Demo file is currently being edited by user s##a##
Internal Machine's Change Request being processed. Approved by user mandeep
I##t##u##o# code is 'C##X##2##' where country is U##t## S##t##

笔记:

  • “边界”字符( \<\> )匹配非单词字符; awk中,单词被定义为数字、字母和下划线的序列; 请参阅GNU awk - 正则表达式运算符以获取更多详细信息
  • 所有示例查找值都在awk字的定义范围内,因此此新代码可以按预期工作
  • 您之前的问题包括不能被视为awk 'word' 的查找值(例如, @vanti Finserv Co.11:11 - CapitalMS&CO(NY) ),在这种情况下,此新代码可能无法替换这些新查找值
  • 对于包含非单词字符的查找值,您不清楚如何定义“全字匹配” ,因为您还需要确定何时将非单词字符(例如@ )视为查找字符串的一部分 vs被视为单词边界

如果您需要替换包含 ( awk ) 非单词字符的查找值,您可以尝试用\W替换单词边界字符,尽管这会导致查找值 ( awk ) '单词' 出现问题。

一种可能的解决方法是为每个查找值运行一组双正则表达式匹配,例如:

awk '
FNR==NR { ... no changes to this block of code ... }

        { for (i in lookups) {
              regex= "\\<" i "\\>"
              gsub(regex,lookups[i])
              regex= "\\W" i "\\W"
              gsub(regex,lookups[i])
          }
          print
        }
' <(cat lookup_1 lookup_2) xyz.txt

您需要确定第二个正则表达式是否违反了“全字匹配”要求。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM