[英]Remove text from HTML files but keep the javascript and structure using python
有很多方法可以从html文件中提取文本,但是我想反过来删除文本,同时结构和javascript代码保持不变。
例如全部删除
同时保持
是否有捷径可寻? 任何帮助是极大的赞赏。 干杯
我会选择BeautifulSoup:
from bs4 import BeautifulSoup
from bs4.element import NavigableString
from copy import copy
def strip_content(in_tag):
tag = copy(in_tag) # remove this line if you don't care about your input
if tag.name == 'script':
# Do no mess with scripts
return tag
# strip content from all children
children = [strip_content(child) for child in tag.children if not isinstance(child, NavigableString)]
# remove everything from the tag
tag.clear()
for child in children:
# Add back stripped children
tag.append(child)
return tag
def test(filename):
soup = BeautifulSoup(open(filename))
cleaned_soup = strip_content(soup)
print(cleaned_soup.prettify())
if __name__ == "__main__":
test("myfile.html")
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.