Removing Processing Instructions with Python lxml

Question

I am using the python lxml library to transform XML files to a new schema but I've encountered problems parsing processing instructions from the XML body.

The processing instruction elements are scattered throughout the XML, as in the following example (they all begin with "oasys" and end with a unique code):

string = "<text><?oasys _dc21-?>Text <i>contents</i></text>"

I can't locate them through the lxml.etree.findall() method, although etree.getchildren() returns them:

tree = lxml.etree.fromstring(string)
print tree.findall(".//")
>>>> [<Element i at 0x747c>]
print tree.getchildren()
>>>> [<?oasys _dc21-?>, <Element i at 0x747x>]
print tree.getchildren()[0].tag
>>>> <built-in function ProcessingInstruction>
print tree.getchildren()[0].tail
>>>> Text

Is there an alternative to using getchildren() to parse and remove processing instructions, especially considering that they're nested at various levels throughout the XML?

Answer 1

You can use the processing-instruction() XPath node test to find the processing instructions and remove them using etree.strip_tags() .

Example:

from lxml import etree

string = "<text><?oasys _dc21-?>Text <i>contents</i></text>"
tree = etree.fromstring(string)

pis = tree.xpath("//processing-instruction()")
for pi in pis:
    etree.strip_tags(pi.getparent(), pi.tag)

print etree.tostring(tree)

Output:

<text>Text <i>contents</i></text>

Removing Processing Instructions with Python lxml

Question

1 answers

solution1
7 ACCPTED 2015-07-20 18:46:00

Removing Processing Instructions with Python lxml

Question

1 answers

solution1 7 ACCPTED 2015-07-20 18:46:00

solution1
7 ACCPTED 2015-07-20 18:46:00