Preserve   in beautiful soup object

Question

I have my sample.htm file as follows:

<html><head>
<title>hello</title>
</head>
<body>
<p>&nbsp; Hello! he said. &nbsp; !</p>
</body>
</html>

I have my python code as follows:

with open('sample.htm', 'r',encoding='utf8') as f:
    contents = f.read()
    soup  = BeautifulSoup(contents, 'html.parser')
    
with open("sample-output.htm", "w", encoding='utf-8') as file:
    file.write(str(soup))

This reads the sample.htm and writes to another sample-output.htm

The output of the above:

<html><head>
<title>hello</title>
</head>
<body>
<p>  Hello! he said.   !</p>
</body>
</html>

How can i preserve the   after writing to the file.

Answer 1

You could always use regex:

import re
import BeautifulSoup

text = '<p>&nbsp; Hello! he said. &nbsp; !</p>'
soup = BeautifulSoup(text,'html.parser')
# text_str = str(soup)
text_str = re.sub(r"\xa0","&nbsp;", str(soup))

I think the BeautifulSoup import may be wrong, but this example is good enough to get the point across. I know this is post-soupify, but I hope it offers a different perspective at a solution.

Answer 2

You could just use str.replace :

>>> text_str.replace("\xa0", "&nbsp;")
'<p>&nbsp; Hello! he said. &nbsp; !</p>'

Where you could use this in your code?

with open("sample-output.htm", "w", encoding='utf-8') as file:
    file.write(str(soup).replace("\xa0", "&nbsp;"))

Answer 3

Read and follow basic docs: Output formatters

If you give Beautiful Soup a document that contains HTML entities like “ &lquot; ”, they'll be converted to Unicode characters
…
If you then convert the document to a string, the Unicode characters will be encoded as UTF-8 . You won't get the HTML entities back
…
You can change this behavior by providing a value for the formatter argument to prettify() , encode() , or decode()
…
If you pass in formatter="html" , Beautiful Soup will convert Unicode characters to HTML entities whenever possible :

soup_string = soup.prettify(formatter="html")
print( soup_string)

 <html> <head> <title> hello </title> </head> <body> <p> &nbsp; Hello. he said; &nbsp; ! </p> </body> </html>

print(type(soup_string)) # for the sake of completeness

 <class 'str'>

Another way (no " prettify "):

print(soup.encode(formatter="html").decode())

 <html><head> <title>hello</title> </head> <body> <p>&nbsp; Hello. he said; &nbsp; !</p> </body> </html>

Preserve   in beautiful soup object

Question

3 answers

solution1
0 2022-03-04 17:38:40

solution2
0 2022-03-04 17:43:40

solution3
0 ACCPTED 2022-03-04 19:43:58

Preserve &nbsp; in beautiful soup object

Question

3 answers

solution1 0 2022-03-04 17:38:40

solution2 0 2022-03-04 17:43:40

solution3 0 ACCPTED 2022-03-04 19:43:58

Preserve in beautiful soup object

solution1
0 2022-03-04 17:38:40

solution2
0 2022-03-04 17:43:40

solution3
0 ACCPTED 2022-03-04 19:43:58