简体   繁体   English

搜索/替换xml的内容

[英]search/replace content of xml

I've been successful using xml.etree.ElementTree to parse an xml, search for content, then write this to a different xml. 我成功使用xml.etree.ElementTree来解析xml,搜索内容,然后将其写入不同的xml。 However, I just worked with text, inside of a singe tag. 但是,我只是在一个标签内部处理文本。

import os, sys, glob, xml.etree.ElementTree as ET
path = r"G:\\63D RRC GIS Data\\metadata\\general\\2010_contract"
for fn in os.listdir(path):
    filepaths = glob.glob(path + os.sep + fn + os.sep + "*overall.xml")
    for filepath in filepaths:
        (pa, filename) = os.path.split(filepath)
        ####use this section to grab element text from old, archived metadata files; this text then gets put into current, working .xml###
        root = ET.parse(pa + os.sep + "archive" + os.sep + "base_metadata_overall.xml").getroot()

        iterator = root.getiterator()
        for item in iterator:
            if item.tag == "abstract":
                correct_abstract = item.text

        root2 = ET.parse(pa + os.sep + "base_metadata_overall.xml").getroot()

        iterator2 = root2.getiterator("descript")
        for item in iterator2:
            if item.tag == "abstract":
                old_abstract = item.find("abstract")
                old_abstract_text = old_abstract.text
                item.remove(old_abstract)
                new_symbol_abstract = ET.SubElement(item, "title")
                new_symbol_abstract.text = correct_abstract                
        tree = ET.ElementTree(root2)
        tree.write(pa + os.sep + "base_metadata_overall.xml")
        print "created --- " + filename + " metadata"

But now, I need to: 但现在,我需要:

1) search an xml and grab everything between "attr" tags, below is example: 1)搜索xml并抓取“attr”标签之间的所有内容,下面是示例:

<attr><attrlabl Sync="TRUE">OBJECTID</attrlabl><attalias Sync="TRUE">ObjectIdentifier</attalias><attrtype Sync="TRUE">OID</attrtype><attwidth Sync="TRUE">4</attwidth><atprecis Sync="TRUE">0</atprecis><attscale Sync="TRUE">0</attscale><attrdef Sync="TRUE">Internal feature number.</attrdef></attr>

2) Now, I need to open a different xml and search for all content between the same "attr" tag and replace with the above. 2)现在,我需要打开一个不同的xml并搜索相同“attr”标记之间的所有内容,并替换为上面的内容。

Basically, what I was doing before, but ignoring subelements, attributes, ect... between "attr" tags and treat it like text. 基本上,我之前在做什么,但忽略了“attr”标签之间的子元素,属性等等,并将其视为文本。

thanks!! 谢谢!!

Please bear with me, this forum is a little different (posting) then Im used to! 请耐心等待,这个论坛有点不同(发帖)然后我习惯了!

Here's what I have so far: 这是我到目前为止所拥有的:

import os, sys, glob, re, xml.etree.ElementTree as ET
from lxml import etree

path = r"C:\\temp\\python\\xml"
for fn in os.listdir(path):
    filepaths = glob.glob(path + os.sep + fn + os.sep +  "*overall.xml")
    for filepath in filepaths:
            (pa, filename) = os.path.split(filepath)

            xml = open(pa + os.sep + "attributes.xml")
            xmltext = xml.read()
            correct_attrs = re.findall("<attr> (.*?)</attr>",xmltext,re.DOTALL)
            for item in correct_attrs:
                correct_attribute = "<attr>" + item + "</attr>"

                xml2 = open(pa + os.sep + "base_metadata_overall.xml")
                xmltext2 = xml2.read()
                old_attrs = re.findall("<attr>(.*?)</attr>",xmltext,re.DOTALL)
                for item2 in old_attrs:
                    old_attribute = "<attr>" + item + "</attr>"               



                    old = etree.fromstring(old_attribute)
                    replacement = new.xpath('//attr')
                    for attr in old.xpath('//attr'):
                        attr.getparent().replace(attr, copy.deepcopy(replacement))
                        print lxml.etree.tostring(old)

got this working, see below, even figured out how to export to new .xml However, If the # of attr's is dif. 得到这个工作,见下文,甚至想出如何导出到新的.xml然而,如果#tr的是非。 from source to dest, I get the following error, any suggestions? 从源到dest,我得到以下错误,有什么建议吗?

node = replacements.pop() node = replacements.pop()

IndexError: pop from empty list IndexError:从空列表中弹出

import os, sys, glob, re, copy, lxml, xml.etree.ElementTree as ET
from lxml import etree
path = r"C:\\temp\\python\\xml"
for fn in os.listdir(path):
filepaths = glob.glob(path + os.sep + fn + os.sep + "*overall.xml")
for filepath in filepaths:
        xmlatributes = open(pa + os.sep + "attributes.xml")
        xmlatributes_txt = xmlatributes.read()
        xmltarget = open(pa + os.sep + "base_metadata_overall.xml")
        xmltarget_txt = xmltarget.read()
        source = lxml.etree.fromstring(xmlatributes_txt)
        dest = lxml.etree.fromstring(xmltarget_txt)            




        replacements = source.xpath('//attr')
        replacements.reverse()


        for attr in dest.xpath('//attr'):
            node = replacements.pop()
            attr.getparent().replace(attr, copy.deepcopy(node))
        #print lxml.etree.tostring(dest)
        tree = ET.ElementTree(dest)
        tree.write (pa + os.sep + "edited_metadata.xml")
        print fn + "--- sucessfully edited"

update 5/16/2011 restructured a few things to fix the "IndexError: pop from empty list" error mentioned above. 更新5/16/2011重新构建了一些内容来修复上面提到的“IndexError:pop from empty list”错误。 Realized that the replacement of the "attr" tags will not always be a 1-to-1 replacement. 意识到更换“attr”标签并不总是一对一的替代品。 For ex. 对于前者 sometimes the source .xml has 20 attr's and the destination .xml has 25 attr's. 有时源.xml有20个attr's,目的地.xml有25个attr's。 In this case, the 1-to-1 replacement would choke. 在这种情况下,1比1的替换会窒息。

Anyway, the below will remove all attr's, then replace with the source attr's. 无论如何,下面将删除所有attr,然后替换源attr's。 It also checks for another tag, "subtype" if it exists, it adds them after the attr's, but inside the "detailed" tags. 它还检查另一个标签,“子类型”(如果存在),它在attr之后添加它们,但在“详细”标签内。

thanks again to everyone who helped. 再次感谢所有帮助过的人。

import os, sys, glob, re, copy, lxml, xml.etree.ElementTree as ET
from lxml import etree
path = r"G:\\63D RRC GIS Data\\metadata\\general\\2010_contract"
#path = r"C:\\temp\python\\xml"
for fn in os.listdir(path):
    correct_title = fn.replace ('_', ' ') + " various facilities"
    correct_fc_name = fn.replace ('_', ' ')
    filepaths = glob.glob(path + os.sep + fn + os.sep + "*overall.xml")
    for filepath in filepaths:
        print "-----" + fn + "-----"
        (pa, filename) = os.path.split(filepath)
        xmlatributes = open(pa + os.sep + "attributes.xml")
        xmlatributes_txt = xmlatributes.read()
        xmltarget = open(pa + os.sep + "base_metadata_overall.xml")
        xmltarget_txt = xmltarget.read()
        source = lxml.etree.fromstring(xmlatributes_txt)
        dest = lxml.etree.fromstring(xmltarget_txt)
        replacements = source.xpath('//attr')
        replacesubtypes = source.xpath('//subtype')
        subtype_true_f = len(replacesubtypes)

        attrtag = dest.xpath('//attr')
        #print len(attrtag)
        num_realatrs = len(replacements)
        for n in attrtag:
            n.getparent().remove(n)
        print n.tag + " removed"

        detailedtag = dest.xpath('//detailed')
        for n2 in detailedtag:
            pos = 0
            for realatrs in replacements:
                n2.insert(pos + 1, realatrs)
            print "attr's replaced"
            if subtype_true_f >= 1:
                #print subtype_true_f
                for realsubtypes in replacesubtypes:
                   n2.insert(num_realatrs + 1, realsubtypes)
                print "subtype's replaced"

        tree = ET.ElementTree(dest)
        tree.write (pa + os.sep + "base_metadata_overall_v2.xml")
        print fn + "--- sucessfully edited"

Here is an example of using lxml to do this. 以下是使用lxml执行此操作的示例。 I'm not exactly sure how you want the <attr/> nodes replaced, but this example should provide a pattern you can reuse. 我不完全确定要如何<attr/>替换节点,但这个例子应该提供可以重复使用的模式。

Update - I changed it to replace each <attr> in tree2 with the corresponding node from tree1, in document order: 更新 - 我更改它以使用tree1中的相应节点替换tree2中的每个<attr> ,按文档顺序:

import copy
import lxml.etree

xml1 = '''<root><attr><chaos foo="0"/></attr><attr><arena foo="1"/></attr></root>'''
xml2 = '''<tree><attr><one/></attr><attr><two/></attr></tree>'''
tree1 = lxml.etree.fromstring(xml1)
tree2 = lxml.etree.fromstring(xml2)

# select <attr/> nodes from tree1, will be used to replace corresponding
# nodes in tree2
replacements = tree1.xpath('//attr')
replacements.reverse()

for attr in tree2.xpath('//attr'):
    # replace the attr node in tree2 with 'replacement' from tree1
    node = replacements.pop()
    attr.getparent().replace(attr, copy.deepcopy(node))

print lxml.etree.tostring(tree2)

Result: 结果:

<tree>
  <attr><chaos foo="0"/></attr>
  <attr><arena foo="1"/></attr>
</tree>

This sounds like something that XSL-T transformations were made for. 这听起来像XSL-T转换的内容。 Have you tried that? 你试过吗?

I'd also recommend a library like Beautiful Soup for parsing and manipulating XML. 我还推荐像Beautiful Soup这样的库来解析和操作XML。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM