简体   繁体   中英

HTML entity handling in Python 3/BeautifulSoup on Windows

I'm having trouble handling HTML containing escaped unicode characters (in the Chinese range) in Python3/BeautifulSoup on Windows. BeautifulSoup seems to function correctly, until I try to print an extracted tag, or write out to file. I have my default encoding set to utf-8, yet a cp1252 codec seems to be getting selected...

To reproduce:

soup = BeautifulSoup("隱")

f = open("out.html", "w")
f.write(soup.text)
f.close()

Stack trace attached.

Traceback (most recent call last):
  File "scrape.py", line 143, in <module>
    test_uni()
  File "scrape.py", line 126, in test_uni
    f.write(soup.text)
  File "c:\venv\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u96b1' in position 0: character maps to <undefined>

You were trying to write non-english (unicode) string to file which Python expects ascii bytestring at default. This is not about your windows environment.

Encode the text before writing to file should work, and utf-8 should be fine with Chinese characters:

f.write(soup.text.encode('utf-8'))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM