python UnicodeEncodeError>如何简单地删除令人不安的unicode字符？

Question

Heres what I did.. 继承人我做了什么..

>>> soup = BeautifulSoup (html)
>>> soup
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xae' in position 96953: ordinal not in range(128)
>>> 
>>> soup.find('div')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xae' in position 11035: ordinal not in range(128)
>>> 
>>> soup.find('span')
<span id="navLogoPrimary" class="navSprite"><span>amazon.com</span></span>
>>>

How can I simply remove troubling unicode characters from html ? 如何从html删除令人不安的unicode字符？
Or is there any cleaner solution ? 或者有更清洁的解决方案吗？

Answer 1

试试这个： soup = BeautifulSoup (html.decode('utf-8', 'ignore'))

Answer 2

The error you see is due to repr(soup) tries to mix Unicode and bytestrings. 您看到的错误是由于repr(soup)尝试混合Unicode和字节串。 Mixing Unicode and bytestrings frequently leads to errors. 混合Unicode和字节串经常会导致错误。

Compare: 相比：

>>> u'1' + '©'
Traceback (most recent call last):
  File "<input>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)

And: 和：

>>> u'1' + u'©'
u'1\xa9'
>>> '1' + u'©'
u'1\xa9'
>>> '1' + '©'
'1\xc2\xa9'

Here's an example for classes: 这是类的示例：

>>> class A:
...     def __repr__(self):
...         return u'copyright ©'.encode('utf-8')
... 
>>> A()
copyright ©
>>> class B:
...     def __repr__(self):
...         return u'copyright ©'
... 
>>> B()
Traceback (most recent call last):
  File "<input>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 10: ordi
nal not in range(128) #' workaround highlighting bug
>>> class C:
...     def __repr__(self):
...         return repr(A()) + repr(B())
...
>>> C()
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "<input>", line 3, in __repr__
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 10: ordi
nal not in range(128)

Similar thing happens with BeautifulSoup : 类似的事情发生在BeautifulSoup ：

>>> html = """<p>©"""
>>> soup = BeautifulSoup(html)
>>> repr(soup)
Traceback (most recent call last):
  File "<input>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 3: ordin
al not in range(128)

To workaround it: 要解决它：

>>> unicode(soup)
u'<p>\xa9</p>'
>>> str(soup)
'<p>\xc2\xa9</p>'
>>> soup.encode('utf-8')
'<p>\xc2\xa9</p>'

Answer 3

First of all, "troubling" unicode characters could be letters in some language but assuming you won't have to worry about non-english characters then you can use a python lib to convert unicode to ansi. 首先，“令人不安”的unicode字符可能是某种语言的字母，但假设您不必担心非英语字符，那么您可以使用python lib将unicode转换为ansi。 Check out the answer to this question: How do I convert a file's format from Unicode to ASCII using Python? 看看这个问题的答案：如何使用Python将文件的格式从Unicode转换为ASCII？

The accepted answer there seems like a good solution (that I didn't know about beforehand). 接受的答案似乎是一个很好的解决方案（事先我不知道）。

Answer 4

I had the same problem, spent hours on it. 我有同样的问题，花了几个小时。 Notice the error occurs whenever the interpreter has to display content, this is because the interpreter is trying to convert to ascii, causing problems. 请注意，只要解释器必须显示内容，就会发生错误，这是因为解释器正在尝试转换为ascii，从而导致出现问题。 Take a look at the top answer here: 看看这里的最佳答案：

UnicodeEncodeError with BeautifulSoup 3.1.0.1 and Python 2.5.2 使用BeautifulSoup 3.1.0.1和Python 2.5.2的UnicodeEncodeError

python UnicodeEncodeError>如何简单地删除令人不安的unicode字符？

问题描述

4 个解决方案

解决方案1
10 已采纳 2011-03-08 18:46:28

解决方案2
2 2011-03-09 12:39:19

解决方案3
1 2011-03-08 18:13:36

解决方案4
0 2012-01-02 22:21:52

python UnicodeEncodeError&gt;如何简单地删除令人不安的unicode字符？

问题描述

4 个解决方案

解决方案1 10 已采纳 2011-03-08 18:46:28

解决方案2 2 2011-03-09 12:39:19

解决方案3 1 2011-03-08 18:13:36

解决方案4 0 2012-01-02 22:21:52

python UnicodeEncodeError>如何简单地删除令人不安的unicode字符？

解决方案1
10 已采纳 2011-03-08 18:46:28

解决方案2
2 2011-03-09 12:39:19

解决方案3
1 2011-03-08 18:13:36

解决方案4
0 2012-01-02 22:21:52