繁体   English   中英

使用Python的ElementTree合并XML文件并维护CDATA标签

[英]Merging XML files using Python's ElementTree and maintaining CDATA tags

我几乎重用了这里相同的代码, 使用python的ElementTree合并xml文件,然后我开始工作了。 我尝试合并的XML文件如下所示

A.XML

<root>
  <categories>
    <category name="Biology" />
  </categories>
  <app>
    <mainHeader><![CDATA[AP Biology]]></mainHeader>
    <questions>
      <question type="0" number="1" title="Biology #1">
        <images />
        <description><![CDATA[<b>Which of the following is 
        the site of protein synthesis?</b>]]></description>
        <category><![CDATA[Biology]]></category>
        <choices>
          <choice name="A"><![CDATA[Cell wall]]></choice>
          <choice name="B" correct_answer="true"><![CDATA[Ribosomes]]></choice>
          <choice name="C"><![CDATA[Vacuoles]]></choice>
          <choice name="D"><![CDATA[DNA polymerase]]></choice>
          <choice name="E"><![CDATA[RNA polymerase]]></choice>
        </choices>
        <explanation><![CDATA[<b>Answer:</b> B, Ribosomes.  Translation, the 
        process that converts mRNA code into protein, takes place in ribosomes.
        <br /><br /><b>Key Takeaway: </b>Ribosomes are complexes of RNA and 
        protein that are located in cell nuclei.  Ribosomes catalyze both the 
        conversion of the mRNA code into amino acids as well as the assembly of 
        the individual amino acids into a peptide change that becomes a protein.
        ]]></explanation>
      </question>
    </questions>
  </app>
</root>

B.XML

<root>
  <categories>
    <category name="Biology" />
  </categories>
  <app>
    <mainHeader><![CDATA[SAT Biology]]></mainHeader>
    <questions>
      <question type="0" number="1" title="Biology #1">
        <images>
        </images>
        <category><![CDATA[Biology]]></category>
        <description><![CDATA[<b>The site of cellular respiration 
        is:</b>]]></description>
        <choices>
          <choice name="A"><![CDATA[DNA polymerase]]></choice>
          <choice name="B"><![CDATA[Ribosomes]]></choice>
          <choice name="C" correct_answer="true"><![CDATA[Mitochondria]]></choice>
          <choice name="D"><![CDATA[RNA polymerase]]></choice>
          <choice name="E"><![CDATA[Vacuoles]]></choice>
        </choices>
        <explanation><![CDATA[<b>Answer:</b> C, Mitochondria.  
        The mitochondrion (plural mitochondria) is known as the “powerhouse” 
        of the cell for its role in energy production.<br /><br />
        <b>Key Takeaway: </b>The mitochondrion is a membrane-bound organelle 
        found in most eukaryotic cells.  The dominant role of the mitochondrion 
        is the production of ATP through cellular respiration, which is 
        dependent on the presence of oxygen.  All forms of cellular 
        respiration, glycolysis, Krebs’ cycle, and oxidative phosphorylation, 
        take place within the mitochondria.]]></explanation>
      </question>
    </questions>
  </app>
</root>

这是我用来合并它们的代码

import os, os.path, sys
import glob
from xml.etree import ElementTree

def run(files):
    xml_files = glob.glob(files +"/*.xml")
    xml_element_tree = None
    for xml_file in xml_files:
        data = ElementTree.parse(xml_file).getroot()
        # print ElementTree.tostring(data)
        for question in data.iter('questions'):
            if xml_element_tree is None:
                xml_element_tree = data 
                insertion_point = xml_element_tree.find('app').findall("./questions")[0]
            else:
                insertion_point.extend(question) 
    if xml_element_tree is not None:
        print ElementTree.tostring(xml_element_tree)

它起作用,除了输出不维护CDATA标记。 具体来说,这是我得到的输出。

<root>
  <categories>
    <category name="Biology" />
  </categories>
  <app>
    <mainHeader>AP Biology</mainHeader>
    <questions>
      <question number="1" title="Biology #1" type="0">
        <images />
        <category>Biology</category>
        <description>&lt;b&gt;Which of the following is the site 
        of protein synthesis?&lt;/b&gt;</description>
        <choices>
          <choice name="A">Cell wall</choice>
          <choice correct_answer="true" name="B">Ribosomes</choice>
          <choice name="C">Vacuoles</choice>
          <choice name="D">DNA polymerase</choice>
          <choice name="E">RNA polymerase</choice>
        </choices>
        <explanation>&lt;b&gt;Answer:&lt;/b&gt; B, Ribosomes.  
        Translation, the process that converts mRNA code into protein, 
        takes place in ribosomes.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;
        Key Takeaway: &lt;/b&gt;Ribosomes are complexes of RNA and protein 
        that are located in cell nuclei.  Ribosomes catalyze both the 
        conversion of the mRNA code into amino acids as well as the assembly 
        of the individual amino acids into a peptide change that becomes 
        a protein.</explanation>
      </question>
      <question number="1" title="Biology #1" type="0">
        <images>
        </images>
        <category>Biology</category>
        <description>&lt;b&gt;The site of cellular respiration is:&lt;/b&gt;
        </description>
        <choices>
          <choice name="A">DNA polymerase</choice>
          <choice name="B">Ribosomes</choice>
          <choice correct_answer="true" name="C">Mitochondria</choice>
          <choice name="D">RNA polymerase</choice>
          <choice name="E">Vacuoles</choice>
        </choices>
        <explanation>&lt;b&gt;Answer:&lt;/b&gt; C, Mitochondria.  The 
        mitochondrion (plural mitochondria) is known as the &#8220;
        powerhouse&#8221; of the cell for its role in energy production.
        &lt;br /&gt;&lt;br /&gt;&lt;b&gt;Key Takeaway: &lt;/b&gt;The 
        mitochondrion is a membrane-bound organelle found in most 
        eukaryotic cells.  The dominant role of the mitochondrion is the 
        production of ATP through cellular respiration, which is dependent 
        on the presence of oxygen.  All forms of cellular respiration, 
        glycolysis, Krebs&#8217; cycle, and oxidative phosphorylation, 
        take place within the mitochondria.</explanation>
      </question>
    </questions>
  </app>
</root>

虽然我想要的输出是

<root>
  <categories>
     <category name="Biology" />
   </categories>
  <app>
    <mainHeader><![CDATA[AP Biology]]></mainHeader>
    <questions>
       <question type="0" number="1" title="Biology #1">
        <images />
        <category><![CDATA[Biology]]></category>
        <description><![CDATA[<b>Which of the following is the 
        site of protein synthesis?</b>]]></description>
        <choices>
          <choice name="A"><![CDATA[Cell wall]]></choice>
          <choice name="B" correct_answer="true"><![CDATA[Ribosomes]]></choice>
          <choice name="C"><![CDATA[Vacuoles]]></choice>
          <choice name="D"><![CDATA[DNA polymerase]]></choice>
          <choice name="E"><![CDATA[RNA polymerase]]></choice>
        </choices>
        <explanation><![CDATA[<b>Answer:</b> B, Ribosomes.  Translation, 
        the process that converts mRNA code into protein, takes place in 
        ribosomes.<br /><br /><b>Key Takeaway: </b>Ribosomes are complexes 
        of RNA and protein that are located in cell nuclei.  Ribosomes 
        catalyze both the conversion of the mRNA code into amino acids as 
        well as the assembly of the individual amino acids into a peptide 
        change that becomes a protein.]]></explanation>
      </question>
      <question type="0" number="2" title="Biology #1">
        <images />
        <category><![CDATA[Biology]]></category>
        <description><![CDATA[<b>The site of cellular respiration 
        is:</b>]]></description>
        <choices>
          <choice name="A"><![CDATA[DNA polymerase]]></choice>
          <choice name="B"><![CDATA[Ribosomes]]></choice>
          <choice name="C" correct_answer="true"><![CDATA[Mitochondria]]></choice>
          <choice name="D"><![CDATA[RNA polymerase]]></choice>
          <choice name="E"><![CDATA[Vacuoles]]></choice>
        </choices>
        <explanation><![CDATA[<b>Answer:</b> C, Mitochondria.  The 
        mitochondrion (plural mitochondria) is known as the “powerhouse” 
        of the cell for its role in energy production.<br /><br />
        <b>Key Takeaway: </b>The mitochondrion is a membrane-bound 
        organelle found in most eukaryotic cells.  The dominant role 
        of the mitochondrion is the production of ATP through cellular 
        respiration, which is dependent on the presence of oxygen.  
        All forms of cellular respiration, glycolysis, Krebs’ cycle, 
        and oxidative phosphorylation, take place within the 
        mitochondria.]]></explanation>
      </question>
    </questions>
  </app>
</root>

如何在合并的输出中维护CDATA标记? 如何将<b><br>" ”保留在合并的输出中,而不是获得&lt;b&gt; 对不起,我真的很菜鸟的问题,但我非常感谢您的帮助。

CDATA专用于xml解析器应忽略的数据。 我认为您在这种情况下所能做的最好的就是捕获文本,如下所示:

>>> element = et.fromstring('''<explanation><![CDATA[<b>Answer:</b> B, Ribosomes.  Translation, 
        the process that converts mRNA code into protein, takes place in 
        ribosomes.<br /><br /><b>Key Takeaway: </b>Ribosomes are complexes 
        of RNA and protein that are located in cell nuclei.  Ribosomes 
        catalyze both the conversion of the mRNA code into amino acids as 
        well as the assembly of the individual amino acids into a peptide 
        change that becomes a protein.]]></explanation>''')
>>> element.text
'<b>Answer:</b> B, Ribosomes.  Translation, \n        the process that converts mRNA code into protein, takes place in \n        ribosomes.<br /><br /><b>Key Takeaway: </b>Ribosomes are complexes \n        of RNA and protein that are located in cell nuclei.  Ribosomes \n        catalyze both the conversion of the mRNA code into amino acids as \n        well as the assembly of the individual amino acids into a peptide \n        change that becomes a protein.'

然后,您可以按照@praveen的建议取消转义文本。

使用HTMLParse python库,但这不会创建那些CDATA东西。

text = """
<root>
  <categories>
    <category name="Biology" />
  </categories>
  <app>
    <mainHeader>AP Biology</mainHeader>
    <questions>
      <question number="1" title="Biology #1" type="0">
        <images />
        <category>Biology</category>
        <description>&lt;b&gt;Which of the following is the site 
        of protein synthesis?&lt;/b&gt;</description>
        <choices>
          <choice name="A">Cell wall</choice>
          <choice correct_answer="true" name="B">Ribosomes</choice>
          <choice name="C">Vacuoles</choice>
          <choice name="D">DNA polymerase</choice>
          <choice name="E">RNA polymerase</choice>
        </choices>
        <explanation>&lt;b&gt;Answer:&lt;/b&gt; B, Ribosomes.  
        Translation, the process that converts mRNA code into protein, 
        takes place in ribosomes.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;
        Key Takeaway: &lt;/b&gt;Ribosomes are complexes of RNA and protein 
        that are located in cell nuclei.  Ribosomes catalyze both the 
        conversion of the mRNA code into amino acids as well as the assembly 
        of the individual amino acids into a peptide change that becomes 
        a protein.</explanation>
      </question>
      <question number="1" title="Biology #1" type="0">
        <images>
        </images>
        <category>Biology</category>
        <description>&lt;b&gt;The site of cellular respiration is:&lt;/b&gt;
        </description>
        <choices>
          <choice name="A">DNA polymerase</choice>
          <choice name="B">Ribosomes</choice>
          <choice correct_answer="true" name="C">Mitochondria</choice>
          <choice name="D">RNA polymerase</choice>
          <choice name="E">Vacuoles</choice>
        </choices>
        <explanation>&lt;b&gt;Answer:&lt;/b&gt; C, Mitochondria.  The 
        mitochondrion (plural mitochondria) is known as the &#8220;
        powerhouse&#8221; of the cell for its role in energy production.
        &lt;br /&gt;&lt;br /&gt;&lt;b&gt;Key Takeaway: &lt;/b&gt;The 
        mitochondrion is a membrane-bound organelle found in most 
        eukaryotic cells.  The dominant role of the mitochondrion is the 
        production of ATP through cellular respiration, which is dependent 
        on the presence of oxygen.  All forms of cellular respiration, 
        glycolysis, Krebs&#8217; cycle, and oxidative phosphorylation, 
        take place within the mitochondria.</explanation>
      </question>
    </questions>
  </app>
</root>
"""

import HTMLParser
html_parser = HTMLParser.HTMLParser()
unescaped = html_parser.unescape(text)

print unescaped

输出:

<root>
  <categories>
    <category name="Biology" />
  </categories>
  <app>
    <mainHeader>AP Biology</mainHeader>
    <questions>
      <question number="1" title="Biology #1" type="0">
        <images />
        <category>Biology</category>
        <description><b>Which of the following is the site 
        of protein synthesis?</b></description>
        <choices>
          <choice name="A">Cell wall</choice>
          <choice correct_answer="true" name="B">Ribosomes</choice>
          <choice name="C">Vacuoles</choice>
          <choice name="D">DNA polymerase</choice>
          <choice name="E">RNA polymerase</choice>
        </choices>
        <explanation><b>Answer:</b> B, Ribosomes.  
        Translation, the process that converts mRNA code into protein, 
        takes place in ribosomes.<br /><br /><b>
        Key Takeaway: </b>Ribosomes are complexes of RNA and protein 
        that are located in cell nuclei.  Ribosomes catalyze both the 
        conversion of the mRNA code into amino acids as well as the assembly 
        of the individual amino acids into a peptide change that becomes 
        a protein.</explanation>
      </question>
      <question number="1" title="Biology #1" type="0">
        <images>
        </images>
        <category>Biology</category>
        <description><b>The site of cellular respiration is:</b>
        </description>
        <choices>
          <choice name="A">DNA polymerase</choice>
          <choice name="B">Ribosomes</choice>
          <choice correct_answer="true" name="C">Mitochondria</choice>
          <choice name="D">RNA polymerase</choice>
          <choice name="E">Vacuoles</choice>
        </choices>
        <explanation><b>Answer:</b> C, Mitochondria.  The 
        mitochondrion (plural mitochondria) is known as the “
        powerhouse” of the cell for its role in energy production.
        <br /><br /><b>Key Takeaway: </b>The 
        mitochondrion is a membrane-bound organelle found in most 
        eukaryotic cells.  The dominant role of the mitochondrion is the 
        production of ATP through cellular respiration, which is dependent 
        on the presence of oxygen.  All forms of cellular respiration, 
        glycolysis, Krebs’ cycle, and oxidative phosphorylation, 
        take place within the mitochondria.</explanation>
      </question>
    </questions>
  </app>
</root>

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM