[英]compare two columns in awk and print values from lookup files into output file
[英]Compare a file with two separate lookup files using awk
基本上,我想检查我的 xyz.txt 文件中是否存在 lookup_1 和 lookup_2 中的字符串,然后执行操作并将 output 重定向到 output 文件。 此外,我的代码目前正在替换 lookup_1 中所有出现的字符串,甚至替换为 substring,但我只需要在整个单词匹配时替换它。 您能否帮助调整代码以实现相同的目标?
代码
awk '
FNR==NR { if ($0 in lookups)
next
lookups[$0]=$0
for (i=1;i<=NF;i++) {
oldstr=$i
newstr=""
while (oldstr) {
len=length(oldstr)
newstr=newstr substr(oldstr,1,1) substr("##",1,len-1)
oldstr=substr(oldstr,4)
}
ndx=index(lookups[$0],$i)
lookups[$0]=substr(lookups[$0],1,ndx-1) newstr substr(lookups[$0],ndx+length($i))
}
next
}
{ for (i in lookups) {
ndx=index($0,i)
while (ndx > 0) { t
$0=substr($0,1,ndx-1) lookups[i] substr($0,ndx+length(lookups[i]))
ndx=index($0,i)
}
}
print
}
' lookup_1 xyz.txt > output.txt
lookup_1
ha
achine
skhatw
at
ree
ter
man
dun
lookup_2
United States
CDEXX123X
Institution
xyz文件
[1] [hamilton] This is a demo file
Demo file is currently being reviewed by user ter
[2] [ter] This is a demo file
Demo file is currently being edited by user skhatw
Internal Machine's Change Request being processed. Approved by user mandeep
Institution code is 'CDEXX123X' where country is United States
当前 output
[1] [h#milton] This is a demo file
Demo file is currently being reviewed by user t##
[2] [t##] This is a demo file
Demo file is currently being edited by user skh#tw
Internal Ma##i##'s Ch#nge Request being processed. Approved by user m##deep
Institution code is 'CDEXX123X' where country is United States
希望 output
[1] [hamilton] This is a demo file
Demo file is currently being reviewed by user t##
[2] [t##] This is a demo file
Demo file is currently being edited by user s##a##
Internal Machine's Change Request being processed. Approved by user mandeep
I##t##u##o# code is 'C##X##2##' where country is U##t## S##t##
我们可以对当前代码进行一些更改:
cat lookup_1 lookup_2
的结果馈送到awk
中,这样它看起来就像一个到awk
的单个文件(参见新代码的最后一行)\<
和\>
)来构建用于执行替换的正则表达式(参见新代码的第二部分)新代码:
awk '
# the FNR==NR block of code remains the same
FNR==NR { if ($0 in lookups)
next
lookups[$0]=$0
for (i=1;i<=NF;i++) {
oldstr=$i
newstr=""
while (oldstr) {
len=length(oldstr)
newstr=newstr substr(oldstr,1,1) substr("##",1,len-1)
oldstr=substr(oldstr,4)
}
ndx=index(lookups[$0],$i)
lookups[$0]=substr(lookups[$0],1,ndx-1) newstr substr(lookups[$0],ndx+length($i))
}
next
}
# complete rewrite of the following block to perform replacements based on a regex using word boundaries
{ for (i in lookups) {
regex= "\\<" i "\\>" # build regex
gsub(regex,lookups[i]) # replace strings that match regex
}
print
}
' <(cat lookup_1 lookup_2) xyz.txt # combine lookup_1/lookup_2 into a single stream so both files are processed under the FNR==NR block of code
这会产生:
[1] [hamilton] This is a demo file
Demo file is currently being reviewed by user t##
[2] [t##] This is a demo file
Demo file is currently being edited by user s##a##
Internal Machine's Change Request being processed. Approved by user mandeep
I##t##u##o# code is 'C##X##2##' where country is U##t## S##t##
笔记:
\<
和\>
)匹配非单词字符; 在awk
中,单词被定义为数字、字母和下划线的序列; 请参阅GNU awk - 正则表达式运算符以获取更多详细信息awk
字的定义范围内,因此此新代码可以按预期工作awk
'word' 的查找值(例如, @vanti Finserv Co.
, 11:11 - Capital
, MS&CO(NY)
),在这种情况下,此新代码可能无法替换这些新查找值@
)视为查找字符串的一部分 vs被视为单词边界如果您需要替换包含 ( awk
) 非单词字符的查找值,您可以尝试用\W
替换单词边界字符,尽管这会导致查找值 ( awk
) '单词' 出现问题。
一种可能的解决方法是为每个查找值运行一组双正则表达式匹配,例如:
awk '
FNR==NR { ... no changes to this block of code ... }
{ for (i in lookups) {
regex= "\\<" i "\\>"
gsub(regex,lookups[i])
regex= "\\W" i "\\W"
gsub(regex,lookups[i])
}
print
}
' <(cat lookup_1 lookup_2) xyz.txt
您需要确定第二个正则表达式是否违反了“全字匹配”要求。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.