简体   繁体   English

如何使用python 2.7遍历xml项的多个子节点

[英]How to iterate through multiple child nodes of xml item using python 2.7

I'm trying to parse imperfectly structured XML data from the USPTO in the form of 我正在尝试以以下形式解析来自USPTO的结构不完美的XML数据:

<parent>
 <child>
  <child-text>text
  <child-text>more text</child-text>
  <child-text>more text</child-text>
  </child-text>
 </child>
</parent>

I'm trying to capture all the text of the child-text nodes. 我正在尝试捕获子文本节点的所有文本。 But as you can see, the first child-text tag does not close until after all the remaining tags have finished. 但是如您所见,第一个子文本标签直到所有其余标签都完成后才关闭。 The following excerpt is an example: 以下摘录是一个示例:

<claims id="claims">
  <claim id="CLM-00001" num="00001">
    <claim-text>1. An all-solid-state electrochromic device comprising:
    <claim-text>a transparent base material; and</claim-text>
    <claim-text>an electrochromic multilayer-stack structure formed on the transparent base material, the electrochromic multilayer-stack structure comprising:
    <claim-text>a first transparent-conductive film;</claim-text>
    <claim-text>an ion-storage layer formed on the first transparent-conductive film;</claim-text>
    <claim-text>a solid-electrolyte layer formed on the ion-storage layer; and</claim-text>
    <claim-text>an electrochromic layer formed on the solid-electrolyte layer, the electrochromic layer comprising a reflection-controllable electrochromic layer comprising an antimony-based alloy comprising Sb<sub>x</sub>CoLi<sub>y </sub>in which 0.5&#x2266;x&#x2266;10, and 0.1&#x2266;y&#x2266;10.</claim-text>
    </claim-text>
    </claim-text>
  </claim>
<claim id="CLM-00002" num="00002">
<claim-text>2. The all-solid-state electrochromic device according to <claim-ref idref="CLM-00001">claim 1</claim-ref>, wherein 3&#x2266;x&#x2266;5 and 0.1&#x2266;y&#x2266;3.</claim-text>
</claim>
</claims>

My current approach is only capturing the content of the first tag and is not adequately capturing content of subelements (such as in the example above): 我当前的方法仅捕获第一个标签的内容,而不能充分捕获子元素的内容(例如在上面的示例中):

claims = self.xml.claim
for i, claim in enumerate(claims):
        data = {}
        data['text'] = claim.contents_of('claim_text', as_string=True, upper=False)

How can I traverse all the <claim-text> tags and <claim-ref> sub-tags notwithstanding the inconsistent structure? 尽管结构不一致,如何遍历所有<claim-text>标签和<claim-ref>子标签?

I had similar issues with xml document. 我对xml文档有类似的问题。 What I did is 我所做的是

<xml_document>[<xml_document>.find("<claim-text>")+len(<claim-text>):<xml_document>.find("</claim-text>")]

this will return content inside the xml tag 这将返回xml标记内的内容

then remove any extra tag inside xml tag content by using if statement 然后使用if语句删除xml标签内容中的所有多余标签

if content contains [<\d>] then remove them by finding their indexes

for every iteration remove the parse part of xml_document through index. 对于每次迭代,都通过索引删除xml_document的解析部分。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM