Python Unencode Unicode HTML十六进制

Question

假设我的琴弦上有很多东西

&#x00e2;&#x0080;&#x009c;words words words

有没有办法通过python直接将它们转换成它们代表的字符？

我试过了

h = HTMLParser.HTMLParser()
print h.unescape(x)

但是出现了这个错误：

UnicodeEncodeError：“ ascii”编解码器无法编码位置0-2处的字符：序数不在范围内（128）

我也试过

print h.unescape(x).encode(utf-8)

但它编码

â 作为â

什么时候应该是报价

Answer 1

â 构成U + 201C左双引号字符的UTF-8字节序列 。 那里主要是东西。 正确的编码应“ 。

您可以使用HTML解析器对此进行转义，但是您需要修复生成的Mochibake ：

>>> import HTMLParser
>>> h = HTMLParser.HTMLParser()
>>> x = '&#x00e2;&#x0080;&#x009c;'
>>> h.unescape(x)
u'\xe2\x80\x9c'
>>> h.unescape(x).encode('latin1')
'\xe2\x80\x9c'
>>> h.unescape(x).encode('latin1').decode('utf8')
u'\u201c'
>>> print h.unescape(x).encode('latin1').decode('utf8')
“

如果打印仍然给您UnicodeEncodeError，则表明您的终端或控制台配置不正确，并且Python意外地编码为ASCII。

Answer 2

问题是您无法正确解码unicode ...您需要将其从unicode转换为utf8

x="&#x00e2;&#x0080;&#x009c;words words words"
h = HTMLParser.HTMLParser()
msg=h.unescape(x) #this converts it to unicode string ..
downcast = "".join(chr(ord(c)&0xff) for c in msg) #convert it to normal string (python2)
print downcast.decode("utf8")

在HTMLParser库中可能有更好的方法...

Python Unencode Unicode HTML十六进制

问题描述

2 个解决方案

解决方案1
1 已采纳 2014-06-24 20:30:06

解决方案2
0 2014-06-24 20:28:56

Python Unencode Unicode HTML十六进制

问题描述

2 个解决方案

解决方案1 1 已采纳 2014-06-24 20:30:06

解决方案2 0 2014-06-24 20:28:56

解决方案1
1 已采纳 2014-06-24 20:30:06

解决方案2
0 2014-06-24 20:28:56