python UnicodeEncodeError>如何简单地删除令人不安的unicode字符？

Question

继承人我做了什么..

>>> soup = BeautifulSoup (html)
>>> soup
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xae' in position 96953: ordinal not in range(128)
>>> 
>>> soup.find('div')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xae' in position 11035: ordinal not in range(128)
>>> 
>>> soup.find('span')
<span id="navLogoPrimary" class="navSprite"><span>amazon.com</span></span>
>>>

如何从html删除令人不安的unicode字符？
或者有更清洁的解决方案吗？

Answer 1

试试这个： soup = BeautifulSoup (html.decode('utf-8', 'ignore'))

Answer 2

您看到的错误是由于repr(soup)尝试混合Unicode和字节串。 混合Unicode和字节串经常会导致错误。

相比：

>>> u'1' + '©'
Traceback (most recent call last):
  File "<input>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)

和：

>>> u'1' + u'©'
u'1\xa9'
>>> '1' + u'©'
u'1\xa9'
>>> '1' + '©'
'1\xc2\xa9'

这是类的示例：

>>> class A:
...     def __repr__(self):
...         return u'copyright ©'.encode('utf-8')
... 
>>> A()
copyright ©
>>> class B:
...     def __repr__(self):
...         return u'copyright ©'
... 
>>> B()
Traceback (most recent call last):
  File "<input>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 10: ordi
nal not in range(128) #' workaround highlighting bug
>>> class C:
...     def __repr__(self):
...         return repr(A()) + repr(B())
...
>>> C()
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "<input>", line 3, in __repr__
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 10: ordi
nal not in range(128)

类似的事情发生在BeautifulSoup ：

>>> html = """<p>©"""
>>> soup = BeautifulSoup(html)
>>> repr(soup)
Traceback (most recent call last):
  File "<input>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 3: ordin
al not in range(128)

要解决它：

>>> unicode(soup)
u'<p>\xa9</p>'
>>> str(soup)
'<p>\xc2\xa9</p>'
>>> soup.encode('utf-8')
'<p>\xc2\xa9</p>'

Answer 3

首先，“令人不安”的unicode字符可能是某种语言的字母，但假设您不必担心非英语字符，那么您可以使用python lib将unicode转换为ansi。 看看这个问题的答案：如何使用Python将文件的格式从Unicode转换为ASCII？

接受的答案似乎是一个很好的解决方案（事先我不知道）。

Answer 4

我有同样的问题，花了几个小时。 请注意，只要解释器必须显示内容，就会发生错误，这是因为解释器正在尝试转换为ascii，从而导致出现问题。 看看这里的最佳答案：

使用BeautifulSoup 3.1.0.1和Python 2.5.2的UnicodeEncodeError

python UnicodeEncodeError>如何简单地删除令人不安的unicode字符？

问题描述

4 个解决方案

解决方案1
10 已采纳 2011-03-08 18:46:28

解决方案2
2 2011-03-09 12:39:19

解决方案3
1 2011-03-08 18:13:36

解决方案4
0 2012-01-02 22:21:52

python UnicodeEncodeError&gt;如何简单地删除令人不安的unicode字符？

问题描述

4 个解决方案

解决方案1 10 已采纳 2011-03-08 18:46:28

解决方案2 2 2011-03-09 12:39:19

解决方案3 1 2011-03-08 18:13:36

解决方案4 0 2012-01-02 22:21:52

python UnicodeEncodeError>如何简单地删除令人不安的unicode字符？

解决方案1
10 已采纳 2011-03-08 18:46:28

解决方案2
2 2011-03-09 12:39:19

解决方案3
1 2011-03-08 18:13:36

解决方案4
0 2012-01-02 22:21:52