简体   繁体   中英

Get Certain Tags Within Parent Tag Using Beautifulsoup4

I am using beautifulsoup4 with Python to scrape content from the web, with which I am attempting to extract content from specific html tags, while ignoring others.

I have the following html:

<div class="the-one-i-want">
    <p>
        "random text content here and about"
    </p>
    <p>
        "random text content here and about"
    </p>
    <p>
        "random text content here and about"
    </p>
    <div class="random-inserted-element-i-dont-want">
        <content>
    </div>
    <p>
        "random text content here and about"
    </p>
    <p>
        "random text content here and about"
    </p>
</div>

My goal is to understand how to instruct python to only get the <p> elements from within the parent <div> class="the-one-i-want"> , otherwise ignoring all the <div> 's within.

Currently, I am locating the content of the parent div by the following method:

content = soup.find('div', class_='the-one-i-want')

However, I can't seem to figure out how to further specify to only extract the <p> tags from that without error.

h = """<div class="the-one-i-want">
    <p>
        "random text content here and about"
    </p>
    <p>
        "random text content here and about"
    </p>
    <p>
        "random text content here and about"
    </p>
    <div class="random-inserted-element-i-dont-want">
        <content>
    </div>
    <p>
        "random text content here and about"
    </p>
    <p>
        "random text content here and about"
    </p>
</div>"""

You can just use find_all("p") after you find:

from bs4 import BeautifulSoup
soup = BeautifulSoup(h)

print(soup.find("div","the-one-i-want").find_all("p"))

Or use a css select:

print(soup.select("div.the-one-i-want p"))

Both will give you:

[<p>\n        "random text content here and about"\n    </p>, <p>\n        "random text content here and about"\n    </p>, <p>\n        "random text content here and about"\n    </p>, <p>\n        "random text content here and about"\n    </p>, <p>\n        "random text content here and about"\n    </p>]

find_all will only find descendants of the div with the class the-one-i-want , the same applies to our select

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM