使用python将中文文本抓取到csv中时的编码问题

Question

I'm having trouble scraping chinese text into a csv. 我在将中文文本抓取到CSV时遇到问题。 I've tried 3 different things (commented in the code), but the csv still contains only garbled text. 我已经尝试了3种不同的方法（在代码中有注释），但是csv仍然只包含乱码。

from bs4 import BeautifulSoup
import urllib2
#import codecs

url="http://v.youku.com/v_show/id_XOTU2Nzc3NDYw.html"
page = urllib2.urlopen(url,context=gcontext).read()#.decode('utf-8', 'ignore')
soup = BeautifulSoup(page)
title= soup.findAll('h1', { "class" : "title" })[0].string#.encode('utf-8')
outputfile='.../file.csv'
fd = open(outputfile,'a')
#fd = codecs.open(outputfile, "a", "utf-8")    
fd.write(title)
fd.close()

Answer 1

This is because You try to decode\\encode by utf-8, You should use other Unicode instead. 这是因为您尝试通过utf-8解码\\编码，而应改用其他Unicode。 Link to page: http://pythoncentral.io/encoding-and-decoding-strings-in-python-3-x/ 链接到页面： http : //pythoncentral.io/encoding-and-decoding-strings-in-python-3-x/

Answer 2

The main page is encoded in utf8. 主页以utf8编码。 I could load it that way: 我可以这样加载：

>>> url="http://v.youku.com/v_show/id_XOTU2Nzc3NDYw.html"
>>> page = urllib2.urlopen(url)
>>> page.headers.get('content-type')
'text/html; charset=UTF-8'
>>> txt = page.read().decode('utf8')
>>> print txt

<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta http-equiv="X-UA-Compatible" content="IE=Edge">
...

So it declares both at HTTP level and in an html meta that it is utf8 encoded and seems to decode nicely in utf8. 因此，它在HTTP级别和html meta中都声明它是utf8编码的，并且似乎在utf8中可以很好地解码。

I went one step further: 我又走了一步：

>>> soup = BeautifulSoup(txt)
>>> title= soup.findAll('h1', { "class" : "title" })[0].string.encode('utf-8')
>>> print repr(title)
'\n\t\t\xe8\xa7\x86\xe9\xa2\x91: \xe3\x80\x90\xe9\xac\xbc\xe9\x97\x95\xe4\xb8\x83\xe7\x9a\x87\xe3\x80\x91\xe5\x90\x84\xe5\x9b\xbd\xe8\xb7\x91\xe9\x85\xb7\xe9\xab\x98\xe6\x89\x8b\xe6\x9e\x81\xe9\x99\x90\xe8\xb7\x91\xe9\x85\xb7\xe6\xb7\xb7\xe5\x89\xaa'

So title is a perfectly correct utf8 encoded byte string, because I could print it and it gave chinese characters. 所以title是一个完全正确的utf8编码的字节字符串，因为我可以打印它并给出汉字。

If the file seems to contain garbage, it is simply because you open it with a non utf8 capable editor, or forgot to put it in utf8 mode. 如果文件似乎包含垃圾，那仅仅是因为您使用不支持utf8的编辑器将其打开，或者忘记了将其置于utf8模式。

使用python将中文文本抓取到csv中时的编码问题

问题描述

2 个解决方案

解决方案1
1 2015-12-11 14:57:55

解决方案2
1 已采纳 2015-12-11 17:57:26

使用python将中文文本抓取到csv中时的编码问题

问题描述

2 个解决方案

解决方案1 1 2015-12-11 14:57:55

解决方案2 1 已采纳 2015-12-11 17:57:26

解决方案1
1 2015-12-11 14:57:55

解决方案2
1 已采纳 2015-12-11 17:57:26