简体   繁体   中英

Unicode Parsing Error with BeautifulSoup

The following code:

from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq

uClient = uReq('http://www.google.com')
page_html = uClient.read()

uClient.close()

page_soup = soup(page_html.decode('utf-8', 'ignore'), 'lxml')
print(page_soup.find_all('p'))

...produces the following error:

C:\>python ws1.py
Traceback (most recent call last):
  File "ws1.py", line 10, in <module>
    print(page_soup.find_all('p'))
  File "C:\Python34\lib\encodings\cp437.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\xa9' in position 40
: character maps to <undefined>

I have searched, in vain, for a solution as every post I have read suggests using a specific encoding none of which has eradicated the problem.

Any help would be appreciated.

Thank you.

You're trying to print a Unicode string that contains characters that can't be represented in the encoding used by your console.

It appears you're using the Windows command line, which means your problem could be solved simply by switching to Python 3.6 - it bypasses the console encoding altogether and sends Unicode straight to Windows.

If that's not possible, you can encode the string yourself and specify that unprintable characters should be replaced with an escape sequence. Then you must decode it again so that print will work properly.

bstr = page_soup.find_all('p').encode(sys.stdout.encoding, errors='backslashreplace')
print(bstr.decode(sys.stdout.encoding))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM