[英]Smart prettify html with BeautifulSoup
有什么辦法可以控制展開的深度嗎? 我的 HTML 有時包含 css。 並美化為每個標簽添加換行符......
<html><body><h1>hello world</h1></body></html>
到:
<html>
<body><h1>hello world</h1></body>
</html>
from bs4 import BeautifulSoup
INPUT_FILE = "html_unformatted.txt"
OUTPUT_FILE = "index.html"
unicode_data = open(INPUT_FILE, "r", encoding='unicode_escape').read()
data = unicode_data.encode('iso-8859-1').decode('utf-8')
soup = BeautifulSoup(data, features="html.parser")
pretty_html = soup.prettify()
with open(OUTPUT_FILE, "w") as f:
f.write(pretty_html)
print(f"Wrote to {OUTPUT_FILE}")
我有:
<html>
<body>
<h1>
hello world
</h1>
</body>
</html>
不幸的是,根據 beautifulsoup 文檔,自定義prettify
功能不是一種選擇。 然而,可以將soup.prettify
包裝到另一個函數中,並用一行文本替換“漂亮”文本。
這就是下面的prettify_except
所做的,即美化除tag_name
包含的文本之外的任何內容:
from bs4 import BeautifulSoup
import re
html = "<html><body><h1>hello world</h1></body></html>"
soup = BeautifulSoup(html, features="html.parser")
print(soup.prettify())
def prettify_except(soup_obj: BeautifulSoup, tag_name: str) -> str:
regex_string = "<{0}>.*<\/{0}>".format(tag_name)
regex = re.compile(regex_string, re.DOTALL)
replacing_txt = str(getattr(soup_obj, tag_name))
return re.sub(regex, replacing_txt, soup_obj.prettify())
print(prettify_except(soup, 'body'))
# original prettified
# <html>
# <body>
# <h1>
# hello world
# </h1>
# </body>
# </html>
# prettified, except body
# <html>
# <body><h1>hello world</h1></body>
# </html>
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.