Is there any way to get an iterator from bs4 findAll() like re.findIter()?

Question

I don't want for bs4 to parse the whole document but I also don't have any way to use the limit argument as I don't know how many links I would need to parse beforehand. If this were re I would use re.finditer() in this situation. But I couldn't find a similar function in bs4.

Answer 1

No, BeautifulSoup does not have a similar "iterative/lazy" version of find_all() .

One thing you can do in terms of not parsing the whole document is SoupStrainer which would at least allow you to focus BeautifulSoup on parsing the desired elements of a page only.

Answer 2

Since you commented that you deal with an XML document, you can use ElementTree whose elements implement .iter (assuming you are using Python >= 3.2):

import xml.etree.ElementTree as ET

doc = ['<root>'] + ['<a href="{}"/>' for i in range(10)] + ['</root>']
doc = ET.fromstring(''.join(doc))
print(doc.iter(tag='a'))
for link in doc.iter(tag='a'):
    print(link)

outputs

# <_elementtree._element_iterator object at 0x000001FFE8B44468>
# <Element 'a' at 0x000001FFD05253B8>
# <Element 'a' at 0x000001FFE8AF62C8>
# <Element 'a' at 0x000001FFE8B32B38>
# <Element 'a' at 0x000001FFE8B32B88>
# <Element 'a' at 0x000001FFE8B41228>
# <Element 'a' at 0x000001FFE8B451D8>
# <Element 'a' at 0x000001FFE8B45228>
# <Element 'a' at 0x000001FFE8B45278>
# <Element 'a' at 0x000001FFE8B452C8>
# <Element 'a' at 0x000001FFE8B45318>

Is there any way to get an iterator from bs4 findAll() like re.findIter()?

Question

2 answers

solution1
2 ACCPTED 2018-12-20 06:19:38

solution2
1 2018-12-15 09:43:15

Is there any way to get an iterator from bs4 findAll() like re.findIter()?

Question

2 answers

solution1 2 ACCPTED 2018-12-20 06:19:38

solution2 1 2018-12-15 09:43:15

solution1
2 ACCPTED 2018-12-20 06:19:38

solution2
1 2018-12-15 09:43:15