简体   繁体   中英

Removing Processing Instructions with Python lxml

I am using the python lxml library to transform XML files to a new schema but I've encountered problems parsing processing instructions from the XML body.

The processing instruction elements are scattered throughout the XML, as in the following example (they all begin with "oasys" and end with a unique code):

string = "<text><?oasys _dc21-?>Text <i>contents</i></text>"

I can't locate them through the lxml.etree.findall() method, although etree.getchildren() returns them:

tree = lxml.etree.fromstring(string)
print tree.findall(".//")
>>>> [<Element i at 0x747c>]
print tree.getchildren()
>>>> [<?oasys _dc21-?>, <Element i at 0x747x>]
print tree.getchildren()[0].tag
>>>> <built-in function ProcessingInstruction>
print tree.getchildren()[0].tail
>>>> Text 

Is there an alternative to using getchildren() to parse and remove processing instructions, especially considering that they're nested at various levels throughout the XML?

You can use the processing-instruction() XPath node test to find the processing instructions and remove them using etree.strip_tags() .

Example:

from lxml import etree

string = "<text><?oasys _dc21-?>Text <i>contents</i></text>"
tree = etree.fromstring(string)

pis = tree.xpath("//processing-instruction()")
for pi in pis:
    etree.strip_tags(pi.getparent(), pi.tag)

print etree.tostring(tree)

Output:

<text>Text <i>contents</i></text>

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM