繁体   English   中英

使用Beautifulsoup4在父标签中获取某些标签

[英]Get Certain Tags Within Parent Tag Using Beautifulsoup4

我正在将beautifulsoup4与Python结合使用,以从Web抓取内容,我试图以此从特定的html标记中提取内容,而忽略其他标记。

我有以下html:

<div class="the-one-i-want">
    <p>
        "random text content here and about"
    </p>
    <p>
        "random text content here and about"
    </p>
    <p>
        "random text content here and about"
    </p>
    <div class="random-inserted-element-i-dont-want">
        <content>
    </div>
    <p>
        "random text content here and about"
    </p>
    <p>
        "random text content here and about"
    </p>
</div>

我的目标是了解如何指示python仅从父<div> class="the-one-i-want">内部获取<p>元素,否则忽略其中的所有<div>

目前,我正在通过以下方法查找父div的内容:

content = soup.find('div', class_='the-one-i-want')

但是,我似乎无法弄清楚如何进一步指定仅从中提取<p>标记而不会出错。

h = """<div class="the-one-i-want">
    <p>
        "random text content here and about"
    </p>
    <p>
        "random text content here and about"
    </p>
    <p>
        "random text content here and about"
    </p>
    <div class="random-inserted-element-i-dont-want">
        <content>
    </div>
    <p>
        "random text content here and about"
    </p>
    <p>
        "random text content here and about"
    </p>
</div>"""

您可以在找到之后使用find_all("p")

from bs4 import BeautifulSoup
soup = BeautifulSoup(h)

print(soup.find("div","the-one-i-want").find_all("p"))

或使用CSS选择:

print(soup.select("div.the-one-i-want p"))

两者都会给你:

[<p>\n        "random text content here and about"\n    </p>, <p>\n        "random text content here and about"\n    </p>, <p>\n        "random text content here and about"\n    </p>, <p>\n        "random text content here and about"\n    </p>, <p>\n        "random text content here and about"\n    </p>]

find_all将仅查找具有the the-one-i-want类的div的后代,这同样适用于我们的select

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM