[英]How to get all the tags (with content) under a certain class with BeautifulSoup?
I have a class in my soup element that is the description of a unit.我的汤元素中有一个 class ,它是一个单位的描述。
<div class="ats-description">
<p>Here is a paragraph</p>
<div>inner div</div>
<div>Another div</div>
<ul>
<li>Item1</li>
<li>Item2</li>
<li>Item3</li>
</ul>
</div>
I can easily grab this part with soup.select(".ats-description")[0]
.我可以用
soup.select(".ats-description")[0]
轻松抓住这部分。 Now I want to remove <div class="ats-description">
, only to keep all the inner tags (to retain text structure).现在我想删除
<div class="ats-description">
,只保留所有内部标签(保留文本结构)。 How to do it?怎么做?
soup.select(".ats-description")[0].getText()
gives me all the texts within, like this: soup.select(".ats-description")[0].getText()
给了我里面的所有文本,像这样:
'\nHere is a paragraph\ninner div\nAnother div\n\nItem1\nItem2\nItem3\n\n\n'
But removes all the inner tags, so it's just unstructured text.但是删除了所有内部标签,所以它只是非结构化文本。 I want to keep the tags as well.
我也想保留标签。
to get innerHTML, use method .decode_contents()
要获取 innerHTML,请使用方法
.decode_contents()
innerHTML = soup.select_one('.ats-description').decode_contents()
print(innerHTML)
Try this, match by tag in list in soup.find_all()
试试这个,在
soup.find_all()
的列表中按标签匹配
from bs4 import BeautifulSoup
html="""<div class="ats-description">
<p>Here is a paragraph</p>
<div>inner div</div>
<div>Another div</div>
<ul>
<li>Item1</li>
<li>Item2</li>
<li>Item3</li>
</ul>
</div>"""
soup = BeautifulSoup(html, 'lxml')
print(soup.select_one("div.ats-description").find_all(['p','div','ul']))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.