I'm trying to parse imperfectly structured XML data from the USPTO in the form of
<parent>
<child>
<child-text>text
<child-text>more text</child-text>
<child-text>more text</child-text>
</child-text>
</child>
</parent>
I'm trying to capture all the text of the child-text nodes. But as you can see, the first child-text tag does not close until after all the remaining tags have finished. The following excerpt is an example:
<claims id="claims">
<claim id="CLM-00001" num="00001">
<claim-text>1. An all-solid-state electrochromic device comprising:
<claim-text>a transparent base material; and</claim-text>
<claim-text>an electrochromic multilayer-stack structure formed on the transparent base material, the electrochromic multilayer-stack structure comprising:
<claim-text>a first transparent-conductive film;</claim-text>
<claim-text>an ion-storage layer formed on the first transparent-conductive film;</claim-text>
<claim-text>a solid-electrolyte layer formed on the ion-storage layer; and</claim-text>
<claim-text>an electrochromic layer formed on the solid-electrolyte layer, the electrochromic layer comprising a reflection-controllable electrochromic layer comprising an antimony-based alloy comprising Sb<sub>x</sub>CoLi<sub>y </sub>in which 0.5≦x≦10, and 0.1≦y≦10.</claim-text>
</claim-text>
</claim-text>
</claim>
<claim id="CLM-00002" num="00002">
<claim-text>2. The all-solid-state electrochromic device according to <claim-ref idref="CLM-00001">claim 1</claim-ref>, wherein 3≦x≦5 and 0.1≦y≦3.</claim-text>
</claim>
</claims>
My current approach is only capturing the content of the first tag and is not adequately capturing content of subelements (such as in the example above):
claims = self.xml.claim
for i, claim in enumerate(claims):
data = {}
data['text'] = claim.contents_of('claim_text', as_string=True, upper=False)
How can I traverse all the <claim-text>
tags and <claim-ref>
sub-tags notwithstanding the inconsistent structure?
I had similar issues with xml document. What I did is
<xml_document>[<xml_document>.find("<claim-text>")+len(<claim-text>):<xml_document>.find("</claim-text>")]
then remove any extra tag inside xml tag content by using if statement
if content contains [<\d>] then remove them by finding their indexes
for every iteration remove the parse part of xml_document through index.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.