简体   繁体   中英

Multiple tag names in lxml's iterparse?

Is there a way to get multiple tag names from lxml's lxml.etree.iterparse? I have a file-like object with an expensive read operation and many tags, so getting all tags or doing two passes is suboptimal.

Edit: It would be something like Beautiful Soup's find(['tag-1', 'tag-2]) , except as an argument to iterparse. Imagine parsing an HTML page for both <td> and <div> tags.

I know I'm late for the game, but maybe someone else needs help with the same issue. This code will generate events for both Tag1 and Tag2 tags:

etree.iterparse(io.BytesIO(xml), events=('end',), tag=('Tag1', 'Tag2'))

I'm not 100% sure what you mean here by "getting all tags", but perhaps this is what you're looking for:

for event, elem in iterparse(file_like_object):
    if elem.tag == 'td' or elem.tag == 'div':
        # reached the end of an interesting tag
        print 'found:', elem.tag
        # possibly quit early to prevent further parsing
        if exit_condition: break

iterparse generates events on the fly during parsing, so you're only reading as much data as is required. However, there's no way you can skip reading elements during parsing, as you wouldn't know how far to skip. In the above, we just ignore tags that we're not interested in.

As you may already know: don't use xml parsers for html. Edit - It turns out that lxml supports html parsing, but you should check the docs to see to what extent.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM