[英]Get Certain Tags Within Parent Tag Using Beautifulsoup4
I am using beautifulsoup4 with Python to scrape content from the web, with which I am attempting to extract content from specific html tags, while ignoring others. 我正在将beautifulsoup4与Python结合使用,以从Web抓取内容,我试图以此从特定的html标记中提取内容,而忽略其他标记。
I have the following html: 我有以下html:
<div class="the-one-i-want">
<p>
"random text content here and about"
</p>
<p>
"random text content here and about"
</p>
<p>
"random text content here and about"
</p>
<div class="random-inserted-element-i-dont-want">
<content>
</div>
<p>
"random text content here and about"
</p>
<p>
"random text content here and about"
</p>
</div>
My goal is to understand how to instruct python to only get the <p>
elements from within the parent <div> class="the-one-i-want">
, otherwise ignoring all the <div>
's within. 我的目标是了解如何指示python仅从父
<div> class="the-one-i-want">
内部获取<p>
元素,否则忽略其中的所有<div>
。
Currently, I am locating the content of the parent div by the following method: 目前,我正在通过以下方法查找父div的内容:
content = soup.find('div', class_='the-one-i-want')
However, I can't seem to figure out how to further specify to only extract the <p>
tags from that without error. 但是,我似乎无法弄清楚如何进一步指定仅从中提取
<p>
标记而不会出错。
h = """<div class="the-one-i-want">
<p>
"random text content here and about"
</p>
<p>
"random text content here and about"
</p>
<p>
"random text content here and about"
</p>
<div class="random-inserted-element-i-dont-want">
<content>
</div>
<p>
"random text content here and about"
</p>
<p>
"random text content here and about"
</p>
</div>"""
You can just use find_all("p")
after you find: 您可以在找到之后使用
find_all("p")
:
from bs4 import BeautifulSoup
soup = BeautifulSoup(h)
print(soup.find("div","the-one-i-want").find_all("p"))
Or use a css select: 或使用CSS选择:
print(soup.select("div.the-one-i-want p"))
Both will give you: 两者都会给你:
[<p>\n "random text content here and about"\n </p>, <p>\n "random text content here and about"\n </p>, <p>\n "random text content here and about"\n </p>, <p>\n "random text content here and about"\n </p>, <p>\n "random text content here and about"\n </p>]
find_all
will only find descendants of the div with the class the-one-i-want
, the same applies to our select
find_all
将仅查找具有the the-one-i-want
类的div的后代,这同样适用于我们的select
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.