python UnicodeEncodeError > How can I simply remove troubling unicode characters?

Question

Heres what I did..

>>> soup = BeautifulSoup (html)
>>> soup
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xae' in position 96953: ordinal not in range(128)
>>> 
>>> soup.find('div')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xae' in position 11035: ordinal not in range(128)
>>> 
>>> soup.find('span')
<span id="navLogoPrimary" class="navSprite"><span>amazon.com</span></span>
>>>

How can I simply remove troubling unicode characters from html ?
Or is there any cleaner solution ?

Answer 1

试试这个： soup = BeautifulSoup (html.decode('utf-8', 'ignore'))

Answer 2

The error you see is due to repr(soup) tries to mix Unicode and bytestrings. Mixing Unicode and bytestrings frequently leads to errors.

Compare:

>>> u'1' + '©'
Traceback (most recent call last):
  File "<input>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)

And:

>>> u'1' + u'©'
u'1\xa9'
>>> '1' + u'©'
u'1\xa9'
>>> '1' + '©'
'1\xc2\xa9'

Here's an example for classes:

>>> class A:
...     def __repr__(self):
...         return u'copyright ©'.encode('utf-8')
... 
>>> A()
copyright ©
>>> class B:
...     def __repr__(self):
...         return u'copyright ©'
... 
>>> B()
Traceback (most recent call last):
  File "<input>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 10: ordi
nal not in range(128) #' workaround highlighting bug
>>> class C:
...     def __repr__(self):
...         return repr(A()) + repr(B())
...
>>> C()
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "<input>", line 3, in __repr__
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 10: ordi
nal not in range(128)

Similar thing happens with BeautifulSoup :

>>> html = """<p>©"""
>>> soup = BeautifulSoup(html)
>>> repr(soup)
Traceback (most recent call last):
  File "<input>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 3: ordin
al not in range(128)

To workaround it:

>>> unicode(soup)
u'<p>\xa9</p>'
>>> str(soup)
'<p>\xc2\xa9</p>'
>>> soup.encode('utf-8')
'<p>\xc2\xa9</p>'

Answer 3

First of all, "troubling" unicode characters could be letters in some language but assuming you won't have to worry about non-english characters then you can use a python lib to convert unicode to ansi. Check out the answer to this question: How do I convert a file's format from Unicode to ASCII using Python?

The accepted answer there seems like a good solution (that I didn't know about beforehand).

Answer 4

I had the same problem, spent hours on it. Notice the error occurs whenever the interpreter has to display content, this is because the interpreter is trying to convert to ascii, causing problems. Take a look at the top answer here:

UnicodeEncodeError with BeautifulSoup 3.1.0.1 and Python 2.5.2

python UnicodeEncodeError > How can I simply remove troubling unicode characters?

Question

4 answers

solution1
10 ACCPTED 2011-03-08 18:46:28

solution2
2 2011-03-09 12:39:19

solution3
1 2011-03-08 18:13:36

solution4
0 2012-01-02 22:21:52

python UnicodeEncodeError > How can I simply remove troubling unicode characters?

Question

4 answers

solution1 10 ACCPTED 2011-03-08 18:46:28

solution2 2 2011-03-09 12:39:19

solution3 1 2011-03-08 18:13:36

solution4 0 2012-01-02 22:21:52

solution1
10 ACCPTED 2011-03-08 18:46:28

solution2
2 2011-03-09 12:39:19

solution3
1 2011-03-08 18:13:36

solution4
0 2012-01-02 22:21:52