简体   繁体   English

Python用Etree替换XML内容

[英]Python replace XML content with Etree

I'd like to parse and compare 2 XML files with the Python Etree parser as follows: 我想使用Python Etree解析器来解析和比较2个XML文件,如下所示:

I have 2 XML files with loads of data. 我有2个XML文件,其中包含大量数据。 One is in English (the source file), the other one the corresponding French translation (the target file). 一种是英语(源文件),另一种是相应的法文翻译(目标文件)。 Eg: 例如:

source file: 源文件:

<AB>
  <CD/>
  <EF>

    <GH>
      <id>123</id>
      <IJ>xyz</IJ>
      <KL>DOG</KL>
      <MN>dogs/dog</MN>
      some more tags and info on same level
      <metadata>
        <entry>
           <cl>Translation</cl>
           <cl>English:dog/dogs</cl>
        </entry>
        <entry>
           <string>blabla</string>
           <string>blabla</string>
        </entry>
            some more strings and entries
      </metadata>
    </GH>

  </EF>
  <stuff/>
  <morestuff/>
  <otherstuff/>
  <stuffstuff/>
  <blubb/>
  <bla/>
  <blubbbla>8</blubbla>
</AB>

The target file looks exactly the same, but has no text at some places: 目标文件看起来完全一样,但是在某些地方没有文本:

<MN>chiens/chien</MN>
some more tags and info on same level
<metadata>
  <entry>
    <cl>Translation</cl>
    <cl></cl>
  </entry>

The French target file has an empty cross-language reference where I'd like to put in the information from the English source file whenever the 2 macros have the same ID. 法语目标文件有一个空的跨语言引用,每当两个宏具有相同的ID时,我要在其中引用英语源文件中的信息。 I already wrote some code in which I replaced the string tag name with a unique tag name in order to identify the cross-language reference. 我已经写了一些代码,其中用唯一的标签名替换了字符串标签名,以便识别跨语言引用。 Now I want to compare the 2 files and if 2 macros have the same ID, exchange the empty reference in the French file with the info from the English file. 现在,我想比较2个文件,如果2个宏具有相同的ID,则将法语文件中的空引用与英语文件中的信息进行交换。 I was trying out the minidom parser before but got stuck and would like to try Etree now. 我之前曾尝试过minipar解析器,但遇到了麻烦,现在想尝试Etree。 I have hardly any knowledge about programming and find this very hard. 我对编程几乎一无所知,并且很难做到这一点。 Here is the code I have so far: 这是我到目前为止的代码:

    macros = ElementTree.parse(english)

    for tag in macros.getchildren('macro'):
        id_ = tag.find('id')
        data = tag.find('cl')
        id_dict[id_.text] = data.text

    macros = ElementTree.parse(french)

    for tag in macros.getchildren('macro'):
        id_ = tag.find('id')
        target = tag.find('cl')
        if target.text.strip() == '':
        target.text = id_dict[id_.text]

    print (ElementTree.tostring(macros))

I am more than clueless and reading other posts on this confuses me even more. 我绝不知情,阅读有关此内容的其他文章更使我感到困惑。 I'd appreciate it very much if someone could enlighten me :-) 如果有人能启发我,我将非常感激:-)

There is probably more details to be clarified. 可能还有更多细节需要澄清。 Here is the sample with some debug prints that shows the idea. 这是带有一些调试打印的样本,它说明了这一想法。 It assumes that both files have exactly the same structure, and that you want to go only one level below the root: 它假定两个文件具有完全相同的结构,并且您只想在根目录下一层:

import xml.etree.ElementTree as etree

english_tree = etree.parse('en.xml')
french_tree = etree.parse('fr.xml')

# Get the root elements, as they support iteration
# through their children (direct descendants)
english_root = english_tree.getroot()
french_root = french_tree.getroot()

# Iterate through the direct descendants of the root
# elements in both trees in parallel.
for en, fr in zip(english_root, french_root):
   assert en.tag == fr.tag # check for the same structure
   if en.tag == 'id':
       assert en.text == fr.text # check for the same id

   elif en.tag == 'string':
       if fr.text is None:
           fr.text = en.text
           print en.text      # displaying what was replaced

etree.dump(french_tree)

For more complex structures of the file, the loop through the direct children of the node can be replaced by iteration through all the elements of the tree. 对于文件的更复杂的结构,可以通过遍历树中所有元素的方式来替换通过节点的直接子级构成的循环。 If the structures of the files are exactly the same, the following code will work: 如果文件的结构完全相同,则以下代码将起作用:

import xml.etree.ElementTree as etree

english_tree = etree.parse('en.xml')
french_tree = etree.parse('fr.xml')

for en, fr in zip(english_tree.iter(), french_tree.iter()):
   assert en.tag == fr.tag        # check if the structure is the same
   if en.tag == 'id':
       assert en.text == fr.text  # identification must be the same
   elif en.tag == 'string':
       if fr.text is None:
           fr.text = en.text
           print en.text          # display the inserted text

# Write the result to the output file.
with open('fr2.xml', 'w') as fout:
    fout.write(etree.tostring(french_tree.getroot()))

However, it works only in cases when both files have exactly the same structure. 但是,它仅在两个文件具有完全相同的结构时才起作用。 Let's follow the algorithm that would be used when the task is to be done manually. 让我们遵循在手动完成任务时将使用的算法。 Firstly, we need to find the French translation that is empty. 首先,我们需要找到空白的法语翻译。 Then it should be replaced by the English translation from the GH element with the same identification. 然后,应使用具有相同标识的GH元素的英文翻译替换它。 A subset of XPath expressions is used in the case when searching for the elements: 搜索元素时,将使用XPath表达式的子集:

import xml.etree.ElementTree as etree

def find_translation(tree, id_):
    # Search fot the GH element with the given identification, and return
    # its translation if found. Otherwise None is returned implicitly.
    for gh in tree.iter('GH'):
       id_elem = gh.find('./id')
       if id_ == id_elem.text:
           # The related GH element found.
           # Find metadata entry, extract the translation.
           # Warning! This is simplification for the fixed position 
           # of the Translation entry.
           me = gh.find('./metadata/entry')
           assert len(me) == 2     # metadata/entry has two elements
           cl1 = me[0]
           assert cl1.text == 'Translation'
           cl2 = me[1]

           return cl2.text


# Body of the program. --------------------------------------------------

english_tree = etree.parse('en.xml')
french_tree = etree.parse('fr.xml')

for gh in french_tree.iter('GH'): # iterate through the GH elements only 
   # Get the identification of the GH section
   id_elem = gh.find('./id')      
   id_ = id_elem.text

   # Find and check the metadata entry, extract the French translation.
   # Warning! This is simplification for the fixed position of the Translation 
   # entry.
   me = gh.find('./metadata/entry')
   assert len(me) == 2     # metadata/entry has two elements
   cl1 = me[0]
   assert cl1.text == 'Translation'
   cl2 = me[1]
   fr_translation = cl2.text

   # If the French translation is empty, put there the English translation
   # from the related element.
   if cl2.text is None:
       cl2.text = find_translation(english_tree, id_)


with open('fr2.xml', 'w') as fout:
   fout.write(etree.tostring(french_tree.getroot()).decode('utf-8'))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM