简体   繁体   中英

python lxml using iterparse to edit and output xml

I've been messing around with the lxml library for a little while and maybe I'm not understanding it correctly or I'm missing something but I can't seem to figure out how to edit the file after I catch a certain xpath and then be able to write that back out into xml while I'm parsing element by element.

Say we have this xml as an example:

<xml>
   <items>
      <pie>cherry</pie>
      <pie>apple</pie>
      <pie>chocolate</pie>
  </items>
</xml>

What I would like to do while parsing is when I hit that xpath of "/xml/items/pie" is to add an element before pie, so it will turn out like this:

<xml>
   <items>
      <item id="1"><pie>cherry</pie></item>
      <item id="2"><pie>apple</pie></item>
      <item id="3"><pie>chocolate</pie></item>
  </items>
</xml>

That output would need to be done by writing to a file line by line as I hit each tag and edit the xml at certain xpaths. I mean I could have it print the starting tag, the text, the attribute if it exists, and then the ending tag by hard coding certain parts, but that would be very messy and it be nice if there was a way to avoid that if possible.

Here's my guess code at this:

from lxml import etree

path=[]
count=0

context=etree.iterparse(file,events=('start','end'))
for event, element in context:
    if event=='start':
       path.append(element.tag)
       if /'+'/'.join(path)=='/xml/items/pie':
          itemnode=etree.Element('item',id=str(count))
          itemnode.text=""
          element.addprevious(itemnode)#Not the right way to do it of course
          #write/print out xml here.
    else:
        element.clear()
        path.pop()

Edit: Also, I need to run through fairly big files, so I have to use iterparse.

There is a more clean way to make modifications you need:

  • iterate over pie elements
  • make an item element
  • use replace() to replace a pie element with item

replace(self, old_element, new_element)

Replaces a subelement with the element passed as second argument.


from lxml import etree
from lxml.etree import XMLParser, Element

data = """<xml>
   <items>
      <pie>cherry</pie>
      <pie>apple</pie>
      <pie>chocolate</pie>
  </items>
</xml>"""


tree = etree.fromstring(data, parser=XMLParser())
items = tree.find('.//items')
for index, pie in enumerate(items.xpath('.//pie'), start=1):
    item = Element('item', {'id': str(index)})
    items.replace(pie, item)
    item.append(pie)

print etree.tostring(tree, pretty_print=True)

prints:

<xml>
   <items>
      <item id="1"><pie>cherry</pie></item>
      <item id="2"><pie>apple</pie></item>
      <item id="3"><pie>chocolate</pie></item>
   </items>
</xml>

Here's a solution using iterparse() . The idea is to catch all tag "start" events, remember the parent ( items ) tag, then for every pie tag create an item tag and put the pie into it:

from StringIO import StringIO
from lxml import etree
from lxml.etree import Element

data = """<xml>
   <items>
      <pie>cherry</pie>
      <pie>apple</pie>
      <pie>chocolate</pie>
  </items>
</xml>"""

stream = StringIO(data)
context = etree.iterparse(stream, events=("start", ))

for action, elem in context:
    if elem.tag == 'items':
        items = elem
        index = 1
    elif elem.tag == 'pie':
        item = Element('item', {'id': str(index)})
        items.replace(elem, item)
        item.append(elem)
        index += 1

print etree.tostring(context.root)

prints:

<xml>
   <items>
      <item id="1"><pie>cherry</pie></item>
      <item id="2"><pie>apple</pie></item>
      <item id="3"><pie>chocolate</pie></item>
   </items>
</xml>

I would suggest you to use an XSLT template, as it seems to match better for this task. Initially XSLT is a little bit tricky until you get used to it, if all you want is to generate some output from an XML, then XSLT is a great tool.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM