简体   繁体   English

BeautifulSoup:剥离指定的属性,但保留标签及其内容

[英]BeautifulSoup: Strip specified attributes, but preserve the tag and its contents

I'm trying to 'defrontpagify' the html of a MS FrontPage generated website, and I'm writing a BeautifulSoup script to do it.我正在尝试对 MS FrontPage 生成的网站的 html 进行“defrontpagify”,并且我正在编写一个 BeautifulSoup 脚本来执行此操作。

However, I've gotten stuck on the part where I try to strip a particular attribute (or list attributes) from every tag in the document that contains them.但是,我在尝试从包含它们的文档中的每个标签中剥离特定属性(或列表属性)的部分遇到了困难。 The code snippet:代码片段:

REMOVE_ATTRIBUTES = ['lang','language','onmouseover','onmouseout','script','style','font',
                        'dir','face','size','color','style','class','width','height','hspace',
                        'border','valign','align','background','bgcolor','text','link','vlink',
                        'alink','cellpadding','cellspacing']

# remove all attributes in REMOVE_ATTRIBUTES from all tags, 
# but preserve the tag and its content. 
for attribute in REMOVE_ATTRIBUTES:
    for tag in soup.findAll(attribute=True):
        del(tag[attribute])

It runs without error, but doesn't actually strip any of the attributes.它运行时没有错误,但实际上并没有去除任何属性。 When I run it without the outer loop, just hard coding a single attribute (soup.findAll('style'=True), it works.当我在没有外循环的情况下运行它时,只需硬编码一个属性(soup.findAll('style'=True),它就可以工作。

Anyone see know the problem here?有谁看到知道这里的问题吗?

PS - I don't much like the nested loops either. PS - 我也不太喜欢嵌套循环。 If anyone knows a more functional, map/filter-ish style, I'd love to see it.如果有人知道更实用的地图/过滤器风格,我很乐意看到它。

The line线

for tag in soup.findAll(attribute=True):

does not find any tag s.没有找到任何tag s。 There might be a way to use findAll ;可能有一种方法可以使用findAll I'm not sure.我不知道。 However, this works:但是,这有效:

import BeautifulSoup
REMOVE_ATTRIBUTES = [
    'lang','language','onmouseover','onmouseout','script','style','font',
    'dir','face','size','color','style','class','width','height','hspace',
    'border','valign','align','background','bgcolor','text','link','vlink',
    'alink','cellpadding','cellspacing']

doc = '''<html><head><title>Page title</title></head><body><p id="firstpara" align="center">This is <i>paragraph</i> <a onmouseout="">one</a>.<p id="secondpara" align="blah">This is <i>paragraph</i> <b>two</b>.</html>'''
soup = BeautifulSoup.BeautifulSoup(doc)
for tag in soup.recursiveChildGenerator():
    try:
        tag.attrs = [(key,value) for key,value in tag.attrs
                     if key not in REMOVE_ATTRIBUTES]
    except AttributeError: 
        # 'NavigableString' object has no attribute 'attrs'
        pass
print(soup.prettify())

Note this this code will only work in Python 3. If you need it to work in Python 2, see Nóra's answer below.请注意,此代码仅适用于 Python 3。如果您需要它在 Python 2 中工作,请参阅下面 Nóra 的回答。

Here's a Python 2 version of unutbu's answer:这是 unutbu 答案的 Python 2 版本:

REMOVE_ATTRIBUTES = ['lang','language','onmouseover']

doc = '''<html><head><title>Page title</title></head><body><p id="firstpara" align="center">This is <i>paragraph</i> <a onmouseout="">one</a>.<p id="secondpara" align="blah">This is <i>paragraph</i> <b>two</b>.</html>'''

soup = BeautifulSoup.BeautifulSoup(doc)

for tag in soup.recursiveChildGenerator():
    if hasattr(tag, 'attrs'):
        tag.attrs = {key:value for key,value in tag.attrs.iteritems()
                    if key not in REMOVE_ATTRIBUTES}

Just ftr: the problem here is that if you pass HTML attributes as keyword arguments, the keyword is the name of the attribute.只是 ftr:这里的问题是,如果您将 HTML 属性作为关键字参数传递,则关键字是属性的名称 So your code is searching for tags with an attribute of name attribute , as the variable does not get expanded.因此,您的代码正在搜索具有 name attribute属性的标签,因为该变量不会被扩展。

This is why这就是为什么

  1. hard-coding your attribute name worked[0]硬编码您的属性名称有效[0]
  2. the code does not fail.代码不会失败。 The search just doesn't match any tags搜索只是不匹配任何标签

To fix the problem, pass the attribute you are looking for as a dict :要解决此问题,请将您要查找的属性作为dict传递:

for attribute in REMOVE_ATTRIBUTES:
    for tag in soup.find_all(attrs={attribute: True}):
        del tag[attribute]

Hth someone in the future, dtk未来的某个人,dtk

[0]: Although it needs to be find_all(style=True) in your example, without the quotes, because SyntaxError: keyword can't be an expression [0]:虽然在您的示例中需要find_all(style=True) ,但没有引号,因为SyntaxError: keyword can't be an expression

I use this method to remove a list of attributes, very compact :我使用此方法删除属性列表,非常紧凑:

attributes_to_del = ["style", "border", "rowspan", "colspan", "width", "height", 
                     "align", "valign", "color", "bgcolor", "cellspacing", 
                     "cellpadding", "onclick", "alt", "title"]
for attr_del in attributes_to_del: 
    [s.attrs.pop(attr_del) for s in soup.find_all() if attr_del in s.attrs]


I use this one:我用这个:

if "align" in div.attrs:
    del div.attrs["align"]

or要么

if "align" in div.attrs:
    div.attrs.pop("align")

Thanks to https://stackoverflow.com/a/22497855/1907997感谢https://stackoverflow.com/a/22497855/1907997

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM