使用BeautifulSoup从文本中删除标记

Question

A lot of questions here with similar title but I'm trying to remove the tag from the soup object itself. 很多问题在这里有类似的标题，但我试图从汤对象本身删除标签。

I have a page that contains among other things this div : 我有一个页面，其中包含这个div ：

<div id="content">
I want to keep this<br /><div id="blah">I want to remove this</div>
</div>

I can select <div id="content"> with soup.find('div', id='content') but I want to remove the <div id="blah"> from it. 我可以用soup.find('div', id='content')选择<div id="content">但我想从中删除<div id="blah"> 。

Answer 1

You can use extract if you want to remove a tag or string from the tree. 如果要从树中删除标记或字符串，可以使用extract 。

In [13]: soup = BeautifulSoup("""<div id="content">
I want to keep this<br /><div id="blah">I want to remove this</div>
</div>""")

In [14]: soup = BeautifulSoup("""<div id="content">
   ....: I want to keep this<br /><div id="blah">I want to remove this</div>
   ....: </div>""")

In [15]: blah = soup.find(id='blah')

In [16]: _ = blah.extract()

In [17]: soup
Out[17]: 
<html><body><div id="content">
I want to keep this<br/>
</div></body></html>

Answer 2

The Tag.decompose method removes tag from the tree. Tag.decompose方法从树中删除tag 。 So find the div tag: 所以找到div标签：

div = soup.find('div', {'id':'content'})

Loop over all the children but the first: 循环所有的孩子，但第一个：

for child in list(div)[1:]:

and try to decompose the children: 并尝试分解孩子们：

    try:
        child.decompose()
    except AttributeError: pass

import bs4 as bs

content = '''<div id="content">
I want to keep this<br /><div id="blah">I want to remove this</div>
</div>'''
soup = bs.BeautifulSoup(content)
div = soup.find('div', {'id':'content'})
for child in list(div)[1:]:
    try:
        child.decompose()
    except AttributeError: pass
print(div)

yields 产量

<div id="content">
I want to keep this
</div>

The equivalent using lxml would be 使用lxml的等价物将是

import lxml.html as LH

content = '''<div id="content">
I want to keep this<br /><div id="blah">I want to remove this</div>
</div>'''
root = LH.fromstring(content)

div = root.xpath('//div[@id="content"]')[0]
for child in div:
    div.remove(child)
print(LH.tostring(div))

使用BeautifulSoup从文本中删除标记

问题描述

2 个解决方案

解决方案1
9 已采纳 2015-07-16 10:32:49

解决方案2
6 2015-07-16 10:36:19

使用BeautifulSoup从文本中删除标记

问题描述

2 个解决方案

解决方案1 9 已采纳 2015-07-16 10:32:49

解决方案2 6 2015-07-16 10:36:19

解决方案1
9 已采纳 2015-07-16 10:32:49

解决方案2
6 2015-07-16 10:36:19