使用Beautifulsoup4在父标签中获取某些标签

Question

I am using beautifulsoup4 with Python to scrape content from the web, with which I am attempting to extract content from specific html tags, while ignoring others. 我正在将beautifulsoup4与Python结合使用，以从Web抓取内容，我试图以此从特定的html标记中提取内容，而忽略其他标记。

I have the following html: 我有以下html：

<div class="the-one-i-want">
    <p>
        "random text content here and about"
    </p>
    <p>
        "random text content here and about"
    </p>
    <p>
        "random text content here and about"
    </p>
    <div class="random-inserted-element-i-dont-want">
        <content>
    </div>
    <p>
        "random text content here and about"
    </p>
    <p>
        "random text content here and about"
    </p>
</div>

My goal is to understand how to instruct python to only get the <p> elements from within the parent <div> class="the-one-i-want"> , otherwise ignoring all the <div> 's within. 我的目标是了解如何指示python仅从父<div> class="the-one-i-want">内部获取<p>元素，否则忽略其中的所有<div> 。

Currently, I am locating the content of the parent div by the following method: 目前，我正在通过以下方法查找父div的内容：

content = soup.find('div', class_='the-one-i-want')

However, I can't seem to figure out how to further specify to only extract the <p> tags from that without error. 但是，我似乎无法弄清楚如何进一步指定仅从中提取<p>标记而不会出错。

Answer 1

h = """<div class="the-one-i-want">
    <p>
        "random text content here and about"
    </p>
    <p>
        "random text content here and about"
    </p>
    <p>
        "random text content here and about"
    </p>
    <div class="random-inserted-element-i-dont-want">
        <content>
    </div>
    <p>
        "random text content here and about"
    </p>
    <p>
        "random text content here and about"
    </p>
</div>"""

You can just use find_all("p") after you find: 您可以在找到之后使用find_all("p") ：

from bs4 import BeautifulSoup
soup = BeautifulSoup(h)

print(soup.find("div","the-one-i-want").find_all("p"))

Or use a css select: 或使用CSS选择：

print(soup.select("div.the-one-i-want p"))

Both will give you: 两者都会给你：

[<p>\n        "random text content here and about"\n    </p>, <p>\n        "random text content here and about"\n    </p>, <p>\n        "random text content here and about"\n    </p>, <p>\n        "random text content here and about"\n    </p>, <p>\n        "random text content here and about"\n    </p>]

find_all will only find descendants of the div with the class the-one-i-want , the same applies to our select find_all将仅查找具有the the-one-i-want类的div的后代，这同样适用于我们的select

使用Beautifulsoup4在父标签中获取某些标签

问题描述

1 个解决方案

解决方案1
2 已采纳 2016-06-24 20:49:34

使用Beautifulsoup4在父标签中获取某些标签

问题描述

1 个解决方案

解决方案1 2 已采纳 2016-06-24 20:49:34

解决方案1
2 已采纳 2016-06-24 20:49:34