使用Python解码未知编码的繁体中文字符串

Question

Hi I have a website that is in Traditional Chinese and when I check the site statistics it tell me that the search term for the website is å%8f°å%8d%97 è¦ªå%90é¤%90å»³ which obviously makes no sense to me. 您好我有一个繁体中文网站，当我查看网站统计时，它告诉我该网站的搜索字词是å%8f°å%8d%97 è¦ªå%90é¤%90å»³这显然是对我没有意义。 My question is what is this encoding called? 我的问题是这个编码叫什么？ And is there a way to use Python to decode this character string. 有没有办法使用Python来解码这个字符串。 Thank you. 谢谢。

Answer 1

It is called a mutt encoding; 它被称为mutt编码; the underlying bytes have been mangled beyond their original meaning and they are no longer a real encoding. 基础字节已被超出其原始含义，并且它们不再是真正的编码。

It was once URL-quoted UTF-8, but now interpreted as latin-1 without unquoting those URL escapes. 它曾经是URL引用的UTF-8，但现在被解释为latin-1而没有取消引用这些URL转义。 I was able to un-mangle this by interpreting it as such: 通过解释它，我能够解决这个问题：

>>> from urllib2 import unquote
>>> bytesquoted = u'å%8f°å%8d%97 è¦ªå%90é¤%90å»³'.encode('latin1')
>>> unquoted = unquote(bytesquoted)
>>> print unquoted.decode('utf8')
台南 親子餐廳

Answer 2

You can use chardet . 你可以使用chardet 。 Install the library with: 安装库：

pip install chardet
# or for python3
pip3 install chardet

The library includes a cli utility chardetect (or chardetect3 accordingly) that takes the path to a file. 该库包含一个cli实用程序chardetect （或相应的chardetect3 ），它接受文件的路径。

Once you know the encoding you can use it in python for example like this: 一旦你知道编码，你可以在python中使用它，例如：

codecs.open('myfile.txt', 'r', 'GB2312')

or from shell: 或者来自shell：

iconv -f GB2312 -t UTF-8 myfile.txt -o decoded.txt

^{If you need more performance then there is also cchardet — a faster C-optimized version of chardet .} ^{如果你需要更多性能，那么还有cchardet - 一个更快的C优化版本的chardet 。}

使用Python解码未知编码的繁体中文字符串

问题描述

2 个解决方案

解决方案1
4 已采纳 2012-09-07 11:11:37

解决方案2
0 2019-01-24 20:34:13

使用Python解码未知编码的繁体中文字符串

问题描述

2 个解决方案

解决方案1 4 已采纳 2012-09-07 11:11:37

解决方案2 0 2019-01-24 20:34:13

解决方案1
4 已采纳 2012-09-07 11:11:37

解决方案2
0 2019-01-24 20:34:13