简体   繁体   English

使用Python解码未知编码的繁体中文字符串

[英]Decoding unknown encoded Traditional Chinese character strings using Python

Hi I have a website that is in Traditional Chinese and when I check the site statistics it tell me that the search term for the website is å%8f°å%8d%97 親å%90é¤%90廳 which obviously makes no sense to me. 您好我有一个繁体中文网站,当我查看网站统计时,它告诉我该网站的搜索字词是å%8f°å%8d%97 親å%90é¤%90廳这显然是对我没有意义。 My question is what is this encoding called? 我的问题是这个编码叫什么? And is there a way to use Python to decode this character string. 有没有办法使用Python来解码这个字符串。 Thank you. 谢谢。

It is called a mutt encoding; 它被称为mutt编码; the underlying bytes have been mangled beyond their original meaning and they are no longer a real encoding. 基础字节已被超出其原始含义,并且它们不再是真正的编码。

It was once URL-quoted UTF-8, but now interpreted as latin-1 without unquoting those URL escapes. 它曾经是URL引用的UTF-8,但现在被解释为latin-1而没有取消引用这些URL转义。 I was able to un-mangle this by interpreting it as such: 通过解释它,我能够解决这个问题:

>>> from urllib2 import unquote
>>> bytesquoted = u'å%8f°å%8d%97 親å­%90é¤%90廳'.encode('latin1')
>>> unquoted = unquote(bytesquoted)
>>> print unquoted.decode('utf8')
台南 親子餐廳

You can use chardet . 你可以使用chardet Install the library with: 安装库:

pip install chardet
# or for python3
pip3 install chardet

The library includes a cli utility chardetect (or chardetect3 accordingly) that takes the path to a file. 该库包含一个cli实用程序chardetect (或相应的chardetect3 ),它接受文件的路径。

Once you know the encoding you can use it in python for example like this: 一旦你知道编码,你可以在python中使用它,例如:

codecs.open('myfile.txt', 'r', 'GB2312')

or from shell: 或者来自shell:

iconv -f GB2312 -t UTF-8 myfile.txt -o decoded.txt

If you need more performance then there is also cchardet — a faster C-optimized version of chardet . 如果你需要更多性能,那么还有cchardet - 一个更快的C优化版本的chardet

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM