繁体   English   中英

这个awk regex替换有什么问题?

[英]What's wrong with this awk regex replacement?

我使用awk regex匹配替换xml文件中的某些文本时遇到了一个特殊的问题。

xml文件很简单。 每个xml的节点中都有一段文本,而awk程序用从文本文件rtxt中选取的另一段文本替换了该文本。 但是由于某种原因,替代42.xml中的文本的rtxt中的文本(标记为“ 42”)无法产生适当的替代。

toxml.awk写入标准输出。 它首先打印已读取的xml,然后打印最终的替换结果。

实际上,我收集了这些xml文件,并用从更长的rtxt中选取的文本进行了替换。 碰巧这种特殊的替换(对于42.xml)不起作用。 代替替换元素中的文本,将另一个标签嵌套在现有标签中。


toxml.awk

BEGIN{
    srcfile = "rtxt"
    FS = "|"

    while (getline <srcfile) {
    xmlfile = $1 ".xml"
    rep = "<narrative>" $2 "</narrative>"

    ## read in the xml file in one go.
    ## (the last tag will be missing.)
    RS = "</topic>"
    FS = "</topic>"

    getline <xmlfile
    #print $0
    close(xmlfile)

    ## replace
    subs = gsub(/<narrative>.*<\/narrative>/, rep, $0)

    ## append the closing tag
    subs = gsub(/[ \n\r\t]+$/, "\n</topic>", $0)
    print $0

    ## restore them before reading rtxt.
    RS = "\n"
    FS = "|"
    }

    close(srcfile)
}

文本

42 |显示Java培训机构和提供Java解决方案的IT公司的详细信息的结果也被认为无关。 Java是Sun Microsystems开发的一种流行的编程语言。 我有兴趣了解这种编程语言,并希望学习它。 有意义的是,结果应提供有关Java历史,不同Java版本以及Java中不同概念的信息。 如果我找到学习Java的教程,那就很好了。 仅与Sun Microsystems相关而不与Java相关的结果被认为是不相关的。 我喜欢找到讨论这种编程语言及其各种概念和版本的文章。


42.xml

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE topic SYSTEM "topic.dtd">
<topic id="2009042" ct_no="227">

  <title>sun java</title>

  <castitle>//article[about(.//language, java) or about(.,sun)]//sec[about(.//language, java)]</castitle>

  <phrasetitle>"sun java"</phrasetitle>

  <description>Find information about Sun Microsystem's Java language</description>

  <narrative>Java is a popular programming language developed at Sun Microsystems. I am interested to know about this programming language, and also to learn programming with it.    To be relevant, a result should give information on history of Java &amp; on different versions of Java, and on different concepts in Java. Its good if I find tutorials for learning Java. Results related only to Sun Microsystems but not Java are considered non-relevant. Results showing details of training institutes for Java, and IT companies which provide Java solutions are also considered non-relevant. I like to find articles that discuss this programming language and various concepts &amp; versions of it.  </narrative>

</topic>

只是一个开始

#!/bin/bash

awk 'BEGIN{FS="|"}
FNR==NR{  nar[$1]=$2; next }
END{
  for(i=2;i<ARGC;i++){
     xmlfile=ARGV[i]
     split(xmlfile,fname,".")
     print "Doing file: "xmlfile
     print "---------------------------------"
     while( (getline line < xmlfile ) > 0)  {
         if ( line ~ /<narrative>/ ){
            line="<narrative>"nar[fname[1]]"</narrative>"
         }
         print line
     }
  }
}' rtxt 42.xml 71.xml

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM