[英]Contradictory encoding specifications of Chinese characters when using Python to web scrape
I am using BeautifulSoup to scrape data from a Chinese online publishing website, and this is the URL to one of the novels http://www.jjwxc.net/onebook.php?novelid=1485737 . 我正在使用BeautifulSoup从一个中国的在线发布网站上抓取数据,这是其中一部小说的URL http://www.jjwxc.net/onebook.php?novelid=1485737 。
I have tried different encoding and decoding schemes (eg, gb2312, utf-8) and their combinations to read the website. 我尝试了不同的编码和解码方案(例如gb2312,utf-8)及其组合来阅读网站。 For example
例如
import requests
from bs4 import BeautifulSoup
url = "http://www.jjwxc.net/onebook.php?novelid=1485737"
response = requests.get(url)
text = response.text
print text.encode('gb2312')
>> UnicodeEncodeError: 'gb2312' codec can't encode character u'\xa1' in position 340: illegal multibyte sequence
print text.encode('utf-8')
>> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=gb2312"/>
<meta http-equiv="X-UA-Compatible" content="IE=EmulateIE7" />
<title>¡¶£¨Õý°æ£©±¼Ô¡·Êñ¿Í_¡¾Ô´´Ð¡Ëµ|ÑÔÇéС˵¡¿_½ú½ÎÄѧ³Ç</title>
<meta name="Keywords" content="Êñ¿Í,£¨Õý°æ£©±¼ÔÂ,Êñ¿Í¡¶£¨Õý°æ£©±¼Ô¡·,Ö÷½Ç£ºÁøÉÒ ©§ Åä½Ç£ºÔ£¬Â½À룬ËÕÐÅ£¬°×ÒÂÚÄÇ£¬Âå¸è£¬×¿ÇïÏÒ£¬ÉÌÓñÈÝ£¬Ð»ÁîÆëµÈµÈ£¨³ö³¡ÅÅÃû£© ©§ ÆäËü£ºÏÉÏÀ£¬ÁøÉÒ£¬ÔÂÉñ£¬Éñ»°,ÇéÓжÀÖÓ Å°ÁµÇéÉî ÁéÒìÉñ¹Ö âêÈ»Èôʧ ×îиüÐÂ:2015-07-15 23:57:04 ×÷Æ·»ý·Ö£º193191456" />
Note that the document itself claims to be encoded using gb2312. 请注意,文档本身声称使用gb2312进行了编码。
I took a tour in the forum and realized that there may be some problems in the encoding definition. 我在论坛上浏览了一下,发现编码定义可能存在一些问题。 If I try the following
如果我尝试以下
import urllib2
html = urllib2.urlopen('http://www.jjwxc.net/onebook.php? novelid=1485737').read()
soup = BeautifulSoup(html)
soup.original_encoding
>> {windows-1252}
But 但
import chardet
chardet.detect(html)
gives 给
>> {'confidence': 0.0, 'encoding': None}
Can someone shine some light onto this problem? 有人可以照亮这个问题吗? Thank you!
谢谢!
I used the method mentioned in how to decode and encode web page with python? 我使用了如何使用python解码和编码网页中提到的方法? , and found that it worked with most Chinese websites but the one that I am interested in.
,并发现它适用于大多数中国网站,但我感兴趣的是该网站。
Try this, it should do the work. 试试这个,它应该做的工作。
The GBK codec provides conversion to and from the Chinese GB18030/GBK/GB2312 encoding.
GBK编解码器可提供与中文GB18030 / GBK / GB2312编码之间的转换。
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
import requests
from bs4 import BeautifulSoup
url = "http://www.jjwxc.net/onebook.php?novelid=1485737"
response = requests.get(url)
text = response.text
text = text.decode('gbk').encode('utf-8')
print text
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=gb2312"/>
<meta http-equiv="X-UA-Compatible" content="IE=EmulateIE7" />
<title>隆露拢篓脮媒掳忙拢漏卤录脭脗隆路脢帽驴脥_隆戮脭颅麓麓脨隆脣碌|脩脭脟茅脨隆脣碌隆驴_陆煤陆颅脦脛脩搂鲁脟</title>
...
...
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.