使用 BeautifulSoup 智能美化 html

Question

有什么辦法可以控制展開的深度嗎？ 我的 HTML 有時包含 css。 並美化為每個標簽添加換行符......

<html><body><h1>hello world</h1></body></html>

到：

<html>
 <body><h1>hello world</h1></body>
</html>

from bs4 import BeautifulSoup

INPUT_FILE = "html_unformatted.txt"
OUTPUT_FILE = "index.html"

unicode_data = open(INPUT_FILE, "r", encoding='unicode_escape').read()
data = unicode_data.encode('iso-8859-1').decode('utf-8')
soup = BeautifulSoup(data, features="html.parser")
pretty_html = soup.prettify()

with open(OUTPUT_FILE, "w") as f:
    f.write(pretty_html)
    print(f"Wrote to {OUTPUT_FILE}")

我有：


<html>
 <body>
  <h1>
   hello world
  </h1>
 </body>
</html>

Answer 1

不幸的是，根據 beautifulsoup 文檔，自定義prettify功能不是一種選擇。 然而，可以將soup.prettify包裝到另一個函數中，並用一行文本替換“漂亮”文本。

這就是下面的prettify_except所做的，即美化除tag_name包含的文本之外的任何內容：

from bs4 import BeautifulSoup
import re

html = "<html><body><h1>hello world</h1></body></html>"
soup = BeautifulSoup(html, features="html.parser")

print(soup.prettify())

def prettify_except(soup_obj: BeautifulSoup, tag_name: str) -> str:
    regex_string = "<{0}>.*<\/{0}>".format(tag_name)
    regex = re.compile(regex_string, re.DOTALL)
    replacing_txt = str(getattr(soup_obj, tag_name))
    return re.sub(regex, replacing_txt, soup_obj.prettify())

print(prettify_except(soup, 'body'))

# original prettified

# <html>
#  <body>
#   <h1>
#    hello world
#   </h1>
#  </body>
# </html>


# prettified, except body

# <html>
#  <body><h1>hello world</h1></body>
# </html>

使用 BeautifulSoup 智能美化 html

問題描述

1 個解決方案

解決方案1
3 2021-10-15 18:14:22

使用 BeautifulSoup 智能美化 html

問題描述

1 個解決方案

解決方案1 3 2021-10-15 18:14:22

解決方案1
3 2021-10-15 18:14:22