从HTML文件中删除文本，但使用python保留javascript和结构

Question

有很多方法可以从html文件中提取文本，但是我想反过来删除文本，同时结构和javascript代码保持不变。

例如全部删除

同时保持

是否有捷径可寻？ 任何帮助是极大的赞赏。 干杯

Answer 1

我会选择BeautifulSoup：

from bs4 import BeautifulSoup
from bs4.element import NavigableString
from copy import copy

def strip_content(in_tag):
    tag = copy(in_tag) # remove this line if you don't care about your input
    if tag.name == 'script':
        # Do no mess with scripts
        return tag
    # strip content from all children
    children = [strip_content(child) for child in tag.children if not isinstance(child, NavigableString)]
    # remove everything from the tag
    tag.clear()
    for child in children:
        # Add back stripped children
        tag.append(child)
    return tag

def test(filename):
    soup = BeautifulSoup(open(filename))
    cleaned_soup = strip_content(soup)
    print(cleaned_soup.prettify())

if __name__ == "__main__":
    test("myfile.html")

从HTML文件中删除文本，但使用python保留javascript和结构

问题描述

1 个解决方案

解决方案1
3 已采纳 2015-07-22 12:04:25

从HTML文件中删除文本，但使用python保留javascript和结构

问题描述

1 个解决方案

解决方案1 3 已采纳 2015-07-22 12:04:25

解决方案1
3 已采纳 2015-07-22 12:04:25