[英]Decoding unknown encoded Traditional Chinese character strings using Python
Hi I have a website that is in Traditional Chinese and when I check the site statistics it tell me that the search term for the website is å%8f°å%8d%97 親å%90é¤%90廳
which obviously makes no sense to me. 您好我有一个繁体中文网站,当我查看网站统计时,它告诉我该网站的搜索字词是
å%8f°å%8d%97 親å%90é¤%90廳
这显然是对我没有意义。 My question is what is this encoding called? 我的问题是这个编码叫什么? And is there a way to use Python to decode this character string.
有没有办法使用Python来解码这个字符串。 Thank you.
谢谢。
It is called a mutt encoding; 它被称为mutt编码; the underlying bytes have been mangled beyond their original meaning and they are no longer a real encoding.
基础字节已被超出其原始含义,并且它们不再是真正的编码。
It was once URL-quoted UTF-8, but now interpreted as latin-1 without unquoting those URL escapes. 它曾经是URL引用的UTF-8,但现在被解释为latin-1而没有取消引用这些URL转义。 I was able to un-mangle this by interpreting it as such:
通过解释它,我能够解决这个问题:
>>> from urllib2 import unquote
>>> bytesquoted = u'å%8f°å%8d%97 親å%90é¤%90廳'.encode('latin1')
>>> unquoted = unquote(bytesquoted)
>>> print unquoted.decode('utf8')
台南 親子餐廳
You can use chardet . 你可以使用chardet 。 Install the library with:
安装库:
pip install chardet
# or for python3
pip3 install chardet
The library includes a cli utility chardetect
(or chardetect3
accordingly) that takes the path to a file. 该库包含一个cli实用程序
chardetect
(或相应的chardetect3
),它接受文件的路径。
Once you know the encoding you can use it in python for example like this: 一旦你知道编码,你可以在python中使用它,例如:
codecs.open('myfile.txt', 'r', 'GB2312')
or from shell: 或者来自shell:
iconv -f GB2312 -t UTF-8 myfile.txt -o decoded.txt
If you need more performance then there is also cchardet — a faster C-optimized version of chardet
. 如果你需要更多性能,那么还有cchardet - 一个更快的C优化版本的
chardet
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.