[英]What's wrong with this awk regex replacement?
我使用awk regex匹配替换xml文件中的某些文本时遇到了一个特殊的问题。
xml文件很简单。 每个xml的节点中都有一段文本,而awk程序用从文本文件rtxt中选取的另一段文本替换了该文本。 但是由于某种原因,替代42.xml中的文本的rtxt中的文本(标记为“ 42”)无法产生适当的替代。
toxml.awk写入标准输出。 它首先打印已读取的xml,然后打印最终的替换结果。
实际上,我收集了这些xml文件,并用从更长的rtxt中选取的文本进行了替换。 碰巧这种特殊的替换(对于42.xml)不起作用。 代替替换元素中的文本,将另一个标签嵌套在现有标签中。
toxml.awk
BEGIN{
srcfile = "rtxt"
FS = "|"
while (getline <srcfile) {
xmlfile = $1 ".xml"
rep = "<narrative>" $2 "</narrative>"
## read in the xml file in one go.
## (the last tag will be missing.)
RS = "</topic>"
FS = "</topic>"
getline <xmlfile
#print $0
close(xmlfile)
## replace
subs = gsub(/<narrative>.*<\/narrative>/, rep, $0)
## append the closing tag
subs = gsub(/[ \n\r\t]+$/, "\n</topic>", $0)
print $0
## restore them before reading rtxt.
RS = "\n"
FS = "|"
}
close(srcfile)
}
文本
42 |显示Java培训机构和提供Java解决方案的IT公司的详细信息的结果也被认为无关。 Java是Sun Microsystems开发的一种流行的编程语言。 我有兴趣了解这种编程语言,并希望学习它。 有意义的是,结果应提供有关Java历史,不同Java版本以及Java中不同概念的信息。 如果我找到学习Java的教程,那就很好了。 仅与Sun Microsystems相关而不与Java相关的结果被认为是不相关的。 我喜欢找到讨论这种编程语言及其各种概念和版本的文章。
42.xml
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE topic SYSTEM "topic.dtd">
<topic id="2009042" ct_no="227">
<title>sun java</title>
<castitle>//article[about(.//language, java) or about(.,sun)]//sec[about(.//language, java)]</castitle>
<phrasetitle>"sun java"</phrasetitle>
<description>Find information about Sun Microsystem's Java language</description>
<narrative>Java is a popular programming language developed at Sun Microsystems. I am interested to know about this programming language, and also to learn programming with it. To be relevant, a result should give information on history of Java & on different versions of Java, and on different concepts in Java. Its good if I find tutorials for learning Java. Results related only to Sun Microsystems but not Java are considered non-relevant. Results showing details of training institutes for Java, and IT companies which provide Java solutions are also considered non-relevant. I like to find articles that discuss this programming language and various concepts & versions of it. </narrative>
</topic>
只是一个开始
#!/bin/bash
awk 'BEGIN{FS="|"}
FNR==NR{ nar[$1]=$2; next }
END{
for(i=2;i<ARGC;i++){
xmlfile=ARGV[i]
split(xmlfile,fname,".")
print "Doing file: "xmlfile
print "---------------------------------"
while( (getline line < xmlfile ) > 0) {
if ( line ~ /<narrative>/ ){
line="<narrative>"nar[fname[1]]"</narrative>"
}
print line
}
}
}' rtxt 42.xml 71.xml
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.