繁体   English   中英

使用python etree打印xml的嵌套元素

[英]Print nested element of xml with python etree

我正在尝试构建一个脚本来读取xml文件。 这是我第一次解析xml,我正在将python与xml.etree.ElementTree一起使用。 我要处理的文件部分如下所示:

    <component>
        <section>
                <id root="42CB916B-BB58-44A0-B8D2-89B4B27F04DF" />
                <code code="34089-3" codeSystem="2.16.840.1.113883.6.1" codeSystemName="LOINC" displayName="DESCRIPTION SECTION" />
                <title mediaType="text/x-hl7-title+xml">DESCRIPTION</title>
                <text>
                        <paragraph>Renese<sup>®</sup> is designated generically as polythiazide, and chemically as 2<content styleCode="italics">H</content>-1,2,4-Benzothiadiazine-7-sulfonamide, 6-chloro-3,4-dihydro-2-methyl-3-[[(2,2,2-trifluoroethyl)thio]methyl]-, 1,1-dioxide. It is a white crystalline substance, insoluble in water but readily soluble in alkaline solution.</paragraph>
                        <paragraph>Inert Ingredients: dibasic calcium phosphate; lactose; magnesium stearate; polyethylene glycol; sodium lauryl sulfate; starch; vanillin. The 2 mg tablets also contain: Yellow 6; Yellow 10.</paragraph>
                </text>
                <effectiveTime value="20051214" />
        </section>
</component>    
<component>
        <section>
               <id root="CF5D392D-F637-417C-810A-7F0B3773264F" />
               <code code="42229-5" codeSystem="2.16.840.1.113883.6.1" codeSystemName="LOINC" displayName="SPL UNCLASSIFIED SECTION" />
               <title mediaType="text/x-hl7-title+xml">ACTION</title>
               <text>
                        <paragraph>The mechanism of action results in an interference with the renal tubular mechanism of electrolyte reabsorption. At maximal therapeutic dosage all thiazides are approximately equal in their diuretic potency. The mechanism whereby thiazides function in the control of hypertension is unknown.</paragraph>
                </text>
                <effectiveTime value="20051214" />
                </section>
</component>

可以从以下位置下载完整文件:

https://dailymed.nlm.nih.gov/dailymed/getFile.cfm?setid=abd6ecf0-dc8e-41de-89f2-1e36ed9d6535&type=zip&name=Renese

这是我的代码:

import xml.etree.ElementTree as ElementTree
import re

with open("ABD6ECF0-DC8E-41DE-89F2-1E36ED9D6535.xml") as f:
    xmlstring = f.read()

# Remove the default namespace definition (xmlns="http://some/namespace")
xmlstring = re.sub(r'\sxmlns="[^"]+"', '', xmlstring, count=1)

tree = ElementTree.fromstring(xmlstring)

for title in tree.iter('title'):
     print(title.text)

到目前为止,我可以打印标题,但我也想打印标签中捕获的相应文本。

我已经试过了:

for title in tree.iter('title'):
     print(title.text)
     for paragraph in title.iter('paragraph'):
         print(paragraph.text)

但是我没有来自段落的输出

for title in tree.iter('title'):
         print(title.text)
         for paragraph in tree.iter('paragraph'):
             print(paragraph.text)

我打印了段落的文本,但是(显然)对于xml结构中找到的每个标题,它们都打印在一起了。

我想找到一种方法来1)识别标题; 2)打印相应的段落。 我该怎么做?

如果您愿意使用lxml,那么以下是使用XPath的解决方案:

import re
from lxml.etree import fromstring


with open("ABD6ECF0-DC8E-41DE-89F2-1E36ED9D6535.xml") as f:
    xmlstring = f.read()

xmlstring = re.sub(r'\sxmlns="[^"]+"', '', xmlstring, count=1)

doc = fromstring(xmlstring.encode())  # lxml only accepts bytes input, hence we encode

for title in doc.xpath('//title'):  # for all title nodes
     title_text = title.xpath('./text()')  # get text value of the node

     # get all text values of the paragraph nodes that appear lower (//paragraph)
     # in the hierarchy than the parent (..) of <title>
     paragraphs_for_title = title.xpath('..//paragraph/text()')

     print(title_text[0] if title_text else '') 
     for paragraph in paragraphs_for_title: 
         print(paragraph)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM