I'm attempting to parse a very extensive HTML document looks something like:
<div class="reportsubsection n" ><br>
<h2> part 1 </h2><br>
<p> insert text here </p><br>
<table> crazy table thing here </table><br>
</div>
<div class="reportsubsection n"><br>
<h2> part 2 </h2><br>
<p> insert text here </p><br>
<table> crazy table thing here </table><br>
</div>
Need to parse out the second div based on h2 having text "Part 2". Iwas able to break out all divs with:
divTag = soup.find("div", {"id": "reportsubsection"})
but didn't know how to dwindle it down from there. Other posts I found I was able to find the specific text "part 2 but I need to be able to output the whole DIV section it is contained in.
EDIT/UPDATE
Ok sorry but I'm still a little lost. Here is what I've got now. I feel like this should be so much simpler than I'm making it. Thanks again for all the help
divTag = soup.find("div", {"id": "reportsubsection"})<br>
for reportsubsection in soup.select('div#reportsubsection #reportsubsection'):<br>
if not reportsubsection.findAll('h2', text=re.compile('Finding')):<br>
continue<br>
print divTag
You can always go back up after finding the right h2
, or you can test all subsections:
for subsection in soup.select('div#reportsubsection #subsection'):
if not subsection.find('h2', text=re.compile('part 2')):
continue
# do something with this subsection
This uses a CSS selector to locate all subsection
s.
Or, going back up with the .parent
attribute :
for header in soup.find_all('h2', text=re.compile('part 2')):
section = header.parent
The trick is to narrow down your search as early as possible; the second option has to find all h2
elements in the whole document, while the former narrows the search down quicker.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.