简体   繁体   中英

How do you use BeautifulSoup to select a tag depending on its children and siblings?

I am trying to extract quotes from 2012 Obama-Romney presidential debate. Problem is the site is not well organized. So the structure looks like this:

<span class="displaytext">
    <p>
        <i>OBAMA</i>Obama's first quotes
    </p>
    <p>More quotes from Obama</p>
    <p>Some more Obama quotes</p>

    <p>
        <i>Moderator</i>Moderator's quotes
    </p>
    <p>Some more quotes</p>

    <p>
        <i>ROMNEY</i>Romney's quotes
    </p>
    <p>More quotes from Romney</p>
    <p>Some more Romney quotes</p>
</span>

Is there a way to select a <p> whose first child is an i that has the text OBAMA AND all it's p siblings UNTIL you hit the next p whose first child is an i that does not have the text Obama ??

Here is what I tried so far, but it is only grabbing the first p ignoring the siblings

input = '''<span class="displaytext">
        <p>
            <i>OBAMA</i>Obama's first quotes
        </p>
        <p>More quotes from Obama</p>
        <p>Some more Obama quotes</p>

       <p>
           <i>Moderator</i>Moderator's quotes
       </p>
       <p>Some more quotes</p>

       <p>
           <i>ROMNEY</i>Romney's quotes
       </p>
       <p>More quotes from Romney</p>
       <p>Some more Romney quotes</p>
       </span>'''

soup = BeautifulSoup(input)
debate_text = soup.find("span", { "class" : "displaytext" })
president_quotes = debate_text.find_all("i", text="OBAMA")

for i in president_quotes:
    siblings = i.next_siblings
    for sibling in siblings:
        print(sibling)

Which only prints Obama's first quotes

I think a kind of finite state machine -like solution will work here. Like this:

soup = BeautifulSoup(input, 'lxml')
debate_text = soup.find("span", { "class" : "displaytext" })
obama_is_on = False
obama_tags = []
for p in debate_text("p"):
    if p.i and 'OBAMA' in p.i:
        # assuming <i> is used only to indicate speaker
        obama_is_on = True
    if p.i and 'OBAMA' not in p.i:
        obama_is_on = False
        continue
    if obama_is_on:
        obama_tags.append(p)
print(obama_tags)

[<p>
<i>OBAMA</i>Obama's first quotes
        </p>, <p>More quotes from Obama</p>, <p>Some more Obama quotes</p>]

The other Obama quotes are siblings of the p , not the i , so you'll need to find the siblings of i 's parent. As you're looping through those siblings, you can stop when one has an i . Something like this:

for i in president_quotes:
    print(i.next_sibling)
    siblings = i.parent.find_next_siblings('p')
    for sibling in siblings:
        if sibling.find("i"):
            break
        print(sibling.string)

which prints:

Obama's first quotes

More quotes from Obama
Some more Obama quotes

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM