[英]Remove tag from text with BeautifulSoup
A lot of questions here with similar title but I'm trying to remove the tag from the soup object itself. 很多问题在这里有类似的标题,但我试图从汤对象本身删除标签。
I have a page that contains among other things this div
: 我有一个页面,其中包含这个
div
:
<div id="content">
I want to keep this<br /><div id="blah">I want to remove this</div>
</div>
I can select <div id="content">
with soup.find('div', id='content')
but I want to remove the <div id="blah">
from it. 我可以用
soup.find('div', id='content')
选择<div id="content">
但我想从中删除<div id="blah">
。
You can use extract
if you want to remove a tag or string from the tree. 如果要从树中删除标记或字符串,可以使用
extract
。
In [13]: soup = BeautifulSoup("""<div id="content">
I want to keep this<br /><div id="blah">I want to remove this</div>
</div>""")
In [14]: soup = BeautifulSoup("""<div id="content">
....: I want to keep this<br /><div id="blah">I want to remove this</div>
....: </div>""")
In [15]: blah = soup.find(id='blah')
In [16]: _ = blah.extract()
In [17]: soup
Out[17]:
<html><body><div id="content">
I want to keep this<br/>
</div></body></html>
The Tag.decompose
method removes tag
from the tree. Tag.decompose
方法从树中删除tag
。 So find the div
tag: 所以找到
div
标签:
div = soup.find('div', {'id':'content'})
Loop over all the children but the first: 循环所有的孩子,但第一个:
for child in list(div)[1:]:
and try to decompose the children: 并尝试分解孩子们:
try:
child.decompose()
except AttributeError: pass
import bs4 as bs
content = '''<div id="content">
I want to keep this<br /><div id="blah">I want to remove this</div>
</div>'''
soup = bs.BeautifulSoup(content)
div = soup.find('div', {'id':'content'})
for child in list(div)[1:]:
try:
child.decompose()
except AttributeError: pass
print(div)
yields 产量
<div id="content">
I want to keep this
</div>
The equivalent using lxml would be 使用lxml的等价物将是
import lxml.html as LH
content = '''<div id="content">
I want to keep this<br /><div id="blah">I want to remove this</div>
</div>'''
root = LH.fromstring(content)
div = root.xpath('//div[@id="content"]')[0]
for child in div:
div.remove(child)
print(LH.tostring(div))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.