简体   繁体   中英

Python/Beautiful Soup find particular heading output full div

I'm attempting to parse a very extensive HTML document looks something like:

<div class="reportsubsection n" ><br>
   <h2> part 1 </h2><br>
   <p> insert text here </p><br>
  <table> crazy table thing here </table><br>
</div>
<div class="reportsubsection n"><br>
   <h2> part 2 </h2><br>
   <p> insert text here </p><br>
   <table> crazy table thing here </table><br>
</div>

Need to parse out the second div based on h2 having text "Part 2". Iwas able to break out all divs with:

divTag = soup.find("div", {"id": "reportsubsection"})

but didn't know how to dwindle it down from there. Other posts I found I was able to find the specific text "part 2 but I need to be able to output the whole DIV section it is contained in.

EDIT/UPDATE

Ok sorry but I'm still a little lost. Here is what I've got now. I feel like this should be so much simpler than I'm making it. Thanks again for all the help

divTag = soup.find("div", {"id": "reportsubsection"})<br>
for reportsubsection in soup.select('div#reportsubsection #reportsubsection'):<br>
    if not reportsubsection.findAll('h2', text=re.compile('Finding')):<br>
        continue<br>
print divTag

You can always go back up after finding the right h2 , or you can test all subsections:

for subsection in soup.select('div#reportsubsection #subsection'):
    if not subsection.find('h2', text=re.compile('part 2')):
        continue
    # do something with this subsection

This uses a CSS selector to locate all subsection s.

Or, going back up with the .parent attribute :

for header in soup.find_all('h2', text=re.compile('part 2')):
    section = header.parent

The trick is to narrow down your search as early as possible; the second option has to find all h2 elements in the whole document, while the former narrows the search down quicker.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM