Unicode Parsing Error with BeautifulSoup

Question

The following code:

from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq

uClient = uReq('http://www.google.com')
page_html = uClient.read()

uClient.close()

page_soup = soup(page_html.decode('utf-8', 'ignore'), 'lxml')
print(page_soup.find_all('p'))

...produces the following error:

C:\>python ws1.py
Traceback (most recent call last):
  File "ws1.py", line 10, in <module>
    print(page_soup.find_all('p'))
  File "C:\Python34\lib\encodings\cp437.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\xa9' in position 40
: character maps to <undefined>

I have searched, in vain, for a solution as every post I have read suggests using a specific encoding none of which has eradicated the problem.

Any help would be appreciated.

Thank you.

Answer 1

You're trying to print a Unicode string that contains characters that can't be represented in the encoding used by your console.

It appears you're using the Windows command line, which means your problem could be solved simply by switching to Python 3.6 - it bypasses the console encoding altogether and sends Unicode straight to Windows.

If that's not possible, you can encode the string yourself and specify that unprintable characters should be replaced with an escape sequence. Then you must decode it again so that print will work properly.

bstr = page_soup.find_all('p').encode(sys.stdout.encoding, errors='backslashreplace')
print(bstr.decode(sys.stdout.encoding))

Unicode Parsing Error with BeautifulSoup

Question

1 answers

solution1
2 ACCPTED 2017-08-08 19:25:48

Unicode Parsing Error with BeautifulSoup

Question

1 answers

solution1 2 ACCPTED 2017-08-08 19:25:48

solution1
2 ACCPTED 2017-08-08 19:25:48