简体   繁体   中英

Smart prettify html with BeautifulSoup

Is there any way I can control the depth of unwrapping? My HTML's sometimes contain css. And prettify adds newline to every tag...

<html><body><h1>hello world</h1></body></html>

to:

<html>
 <body><h1>hello world</h1></body>
</html>
from bs4 import BeautifulSoup

INPUT_FILE = "html_unformatted.txt"
OUTPUT_FILE = "index.html"

unicode_data = open(INPUT_FILE, "r", encoding='unicode_escape').read()
data = unicode_data.encode('iso-8859-1').decode('utf-8')
soup = BeautifulSoup(data, features="html.parser")
pretty_html = soup.prettify()

with open(OUTPUT_FILE, "w") as f:
    f.write(pretty_html)
    print(f"Wrote to {OUTPUT_FILE}")

I have:


<html>
 <body>
  <h1>
   hello world
  </h1>
 </body>
</html>

Unfortunately, according to beautifulsoup docs , customising the prettify function is not an option. However, one could wrap soup.prettify into another function and replace the "pretty" text with one-line text.

That's what prettify_except below does, ie prettifying anything but the text contained in tag_name :

from bs4 import BeautifulSoup
import re

html = "<html><body><h1>hello world</h1></body></html>"
soup = BeautifulSoup(html, features="html.parser")

print(soup.prettify())

def prettify_except(soup_obj: BeautifulSoup, tag_name: str) -> str:
    regex_string = "<{0}>.*<\/{0}>".format(tag_name)
    regex = re.compile(regex_string, re.DOTALL)
    replacing_txt = str(getattr(soup_obj, tag_name))
    return re.sub(regex, replacing_txt, soup_obj.prettify())

print(prettify_except(soup, 'body'))

# original prettified

# <html>
#  <body>
#   <h1>
#    hello world
#   </h1>
#  </body>
# </html>


# prettified, except body

# <html>
#  <body><h1>hello world</h1></body>
# </html>

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM