简体   繁体   中英

Is there any way to get an iterator from bs4 findAll() like re.findIter()?

I don't want for bs4 to parse the whole document but I also don't have any way to use the limit argument as I don't know how many links I would need to parse beforehand. If this were re I would use re.finditer() in this situation. But I couldn't find a similar function in bs4.

No, BeautifulSoup does not have a similar "iterative/lazy" version of find_all() .

One thing you can do in terms of not parsing the whole document is SoupStrainer which would at least allow you to focus BeautifulSoup on parsing the desired elements of a page only.

Since you commented that you deal with an XML document, you can use ElementTree whose elements implement .iter (assuming you are using Python >= 3.2):

import xml.etree.ElementTree as ET

doc = ['<root>'] + ['<a href="{}"/>' for i in range(10)] + ['</root>']
doc = ET.fromstring(''.join(doc))
print(doc.iter(tag='a'))
for link in doc.iter(tag='a'):
    print(link)

outputs

# <_elementtree._element_iterator object at 0x000001FFE8B44468>
# <Element 'a' at 0x000001FFD05253B8>
# <Element 'a' at 0x000001FFE8AF62C8>
# <Element 'a' at 0x000001FFE8B32B38>
# <Element 'a' at 0x000001FFE8B32B88>
# <Element 'a' at 0x000001FFE8B41228>
# <Element 'a' at 0x000001FFE8B451D8>
# <Element 'a' at 0x000001FFE8B45228>
# <Element 'a' at 0x000001FFE8B45278>
# <Element 'a' at 0x000001FFE8B452C8>
# <Element 'a' at 0x000001FFE8B45318>

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM