简体   繁体   中英

python encoding chinese to special character

I have scrape/curl request to get html from other site, that have chinese language but some text result is weird, it showing like this:

°¢Àï°Í°ÍΪÄúÌṩÁË×ÔÁµÕß¹¤³§Ö±ÏúÆ·ÅƵç×Ó±í ÖÇÄÜʱÉг±Á÷ŮʿÊÖ»·ÊÖÁ´Ê×Êαí´øµÈ²úÆ·£¬ÕâÀïÔƼ¯ÁËÖÚ¶àµÄ¹©Ó¦ÉÌ£¬²É¹ºÉÌ£¬ÖÆÔìÉÌ¡£ÓûÁ˽â¸ü¶à×ÔÁµÕß¹¤³§Ö±ÏúÆ·ÅƵç×Ó±í ÖÇÄÜʱÉг±Á÷ŮʿÊÖ»·ÊÖÁ´Ê×Êαí´øÐÅÏ¢£¬Çë·ÃÎÊ°¢Àï°Í°ÍÅú·¢Íø£¡

that should be in chinese language, and this is my code:

str(result.decode('ISO-8859-1'))

If without decode 'ISO-8859-1' (only return result variable) it will display question mark like this:

����Ͱ�Ϊ���ṩ�������߹���ֱ��Ʒ�Ƶ��ӱ� ����ʱ�г���Ůʿ�ֻ��������α����Ȳ�Ʒ�������Ƽ����ڶ�Ĺ�Ӧ�̣��ɹ��̣������̡����˽���������߹���ֱ��Ʒ�Ƶ��ӱ� ����ʱ�г���Ůʿ�ֻ��������α�����Ϣ������ʰ���Ͱ���������

Could you help me which encode/decode that I should use?

Thanks

Chinese has several possible charsets.
3 common chinese charsets are: gb2312,big5 and gbk.
Here is a snippet to convert from gb2312 to utf-8 .

import codecs

infile = codecs.open("in.txt", "r", "gb2312")
lines = infile.readline()
infile.close()

print(lines)

outfile = codecs.open("out.txt", "wb", "utf-8")
outfile.writelines(lines)
outfile.close()

正如@Thu Yein tun所提到的,这是一个非常简单的解决方案,它可以查看内容类型的http请求链接的标头响应,然后显示为text / html; charset = GBK,然后将解决方案提供给我的代码像这样

result.decode('gbk')

Try this block of code.

You can do by importing the unquote file & encode the content using latin1 encoding mechanism.

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from urllib2 import unquote

bytesquoted = u'å%8f°å%8d%97 親å­%90é¤%90廳'.encode('latin1')
unquoted = unquote(bytesquoted)
print unquoted.decode('utf8')

Output :

台南 親子餐廳

Adding on to the answer provided by @Usman above. For Python 3.x you may do this:

import urllib

bytesquoted = u'å%8f°å%8d%97 親å­%90é¤%90廳'.encode('latin1')
print(urllib.parse.unquote(bytesquoted)) #'台南 親子餐廳'

This should work for you.

Reference: Python: Importing urllib.quote

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM