简体   繁体   中英

Parsing same content twice with lxml.iterparse

I do not get why this works:

content = urllib2.urlopen(url)

context = etree.iterparse(content, tag='{my_ns}my_first_tag')
context = iter(context)
#for event, elem in context:
#     pass

context = etree.iterparse(content, tag='{my_ns}my_second_tag')
for event, elem in context:
     pass

where this doesn't work:

content = urllib2.urlopen(url)

context = etree.iterparse(content, tag='{my_ns}my_first_tag')
context = iter(context)
for event, elem in context:
     pass

context = etree.iterparse(content, tag='{my_ns}my_second_tag')
for event, elem in context:
     pass

and gives me this error:

XMLSyntaxError: Extra content at the end of the document, line 1, column 1

Can I not parse the same content twice? Strange that it is working when I just comment the loop and not the whole iterparse command.

Am I missing to close something?

Many thanks

urllib2.urlopen gives you a file-like object that you can use to read the contents of the URL you're querying.

I'm guessing here that etree.iterparse returns an object that can be iterated but doesn't touch content at all until then. In that case, the first loop is using context to iterate over the contents of content , "consuming" the data as it goes.

When you create the second context , you're passing the same content , which is "empty" by then.

Edit: as you ask for ways to reparse... One would be to read out the whole data and then pass it separately to each iterparse call using StringIO as the file-like object. Eg.

from StringIO import StringIO

# ...

data = content.read()
context = etree.iterparse(StringIO(data), tag='{my_ns}my_first_tag')
# processing...
context = etree.iterparse(StringIO(data), tag='{my_ns}my_second_tag')
# processing...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM