简体   繁体   English

美汤4:删除评论标签及其内容

[英]Beautiful Soup 4: Remove comment tag and its content

The page that I'm scraping contains these HTML codes.我正在抓取的页面包含这些 HTML 代码。 How do I remove the comment tag <!-- --> along with its content with bs4?如何使用 bs4 删除注释标签<!-- -->及其内容?

<div class="foo">
cat dog sheep goat
<!-- 
<p>NewPP limit report
Preprocessor node count: 478/300000
Post‐expand include size: 4852/2097152 bytes
Template argument size: 870/2097152 bytes
Expensive parser function count: 2/100
ExtLoops count: 6/100
</p>
-->
</div>

You can use extract() (solution is based on this answer ): 您可以使用extract() (解决方案基于此答案 ):

PageElement.extract() removes a tag or string from the tree. PageElement.extract()从树中删除标签或字符串。 It returns the tag or string that was extracted. 它返回提取的标签或字符串。

from bs4 import BeautifulSoup, Comment

data = """<div class="foo">
cat dog sheep goat
<!--
<p>test</p>
-->
</div>"""

soup = BeautifulSoup(data)

div = soup.find('div', class_='foo')
for element in div(text=lambda text: isinstance(text, Comment)):
    element.extract()

print soup.prettify()

As a result you get your div without comments: 结果,您的div没有注释:

<div class="foo">
    cat dog sheep goat
</div>

Usually modifying the bs4 parse tree is unnecessary. 通常不需要修改bs4解析树。 You can just get the div's text, if that's what you wanted: 您只需要获取div的文本即可:

soup.body.div.text
Out[18]: '\ncat dog sheep goat\n\n'

bs4 separates out the comment. bs4分隔注释。 However if you really need to modify the parse tree: 但是,如果您确实需要修改解析树,请执行以下操作:

from bs4 import Comment

for child in soup.body.div.children:
    if isinstance(child,Comment):
        child.extract()

From this answer If you are looking for solution in BeautifulSoup version 3 BS3 Docs - Comment 从这个答案中如果您正在BeautifulSoup版本3 BS3 Docs中寻找解决方案-评论

soup = BeautifulSoup("""Hello! <!--I've got to be nice to get what I want.-->""")
comment = soup.find(text=re.compile("if"))
Comment=comment.__class__
for element in soup(text=lambda text: isinstance(text, Comment)):
    element.extract()
print soup.prettify()

Little late, but i have compared main answers on internet so you can choose whats best for you:有点晚了,但我已经比较了互联网上的主要答案,因此您可以选择最适合您的答案:

we can do the removal of comments by regex also我们也可以通过正则表达式删除评论

soupstr=str(soup)
result=re.sub(r'<!.*?->','', soupstr)

but this method of regex is 4 times slower when we convert soup to string via soupstr=str(soup) than findAll...isinstance(x,Comment) as written by others.但是当我们通过soupstr=str(soup)将soup 转换为字符串时,这种regex 方法比其他人编写的findAll...isinstance(x,Comment)慢4 倍。

But is 5 times faster when you have html as string and apply regex processing to remove comments.但是,当您将 html 作为字符串并应用正则表达式处理来删除注释时,速度会快 5 倍。

benchmark result after running functions 1000 times:运行函数 1000 次后的基准测试结果:

bs4,isinstance(x,Comment) method: time: 0.01193189620971680ms
soup convert to string and apply regex: 0.04188799858093262ms
apply regex before converting to soup : 0.00195980072021484ms (WINNER!)

maybe you can use pure regex in cases where you dont want to use isinstance method.也许您可以在不想使用 isinstance 方法的情况下使用纯正则表达式。

for people who need quick result and dont want to read full answer, here is the copy paste function ready to run:对于需要快速结果但不想阅读完整答案的人,这里是准备运行的复制粘贴功能:

def remove_comments_regexmethod(soup): 
    #soup argument can be string or bs4.beautifulSoup instance it will auto convert to string, please prefer to input as (string) than (soup) if you want highest speed
    if not isinstance(soup,str): 
        soup=str(soup)
    return re.sub(r'<!.*?->','', soup)#returns a string

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM