urllib2讀取為Unicode

Question

我需要存儲可以使用任何語言的網站內容。 我需要能夠在內容中搜索Unicode字符串。

我嘗試過類似的東西：

import urllib2

req = urllib2.urlopen('http://lenta.ru')
content = req.read()

內容是一個字節流，所以我可以在其中搜索Unicode字符串。

我需要一些方法，當我做urlopen然后讀取使用標題中的charset解碼內容並將其編碼為UTF-8。

Answer 1

在您執行的操作之后，您將看到：

>>> req.headers['content-type']
'text/html; charset=windows-1251'

所以：

>>> encoding=req.headers['content-type'].split('charset=')[-1]
>>> ucontent = unicode(content, encoding)

ucontent現在是一個Unicode字符串（ ucontent字符） - 例如，如果您的終端是UTF-8，則顯示其中的一部分：

>>> print ucontent[76:110].encode('utf-8')
<title>Lenta.ru: Главное: </title>

你可以搜索等等

編輯：Unicode I / O通常很棘手（這可能是阻止原始提問者）但是我將繞過將Unicode字符串輸入到交互式Python解釋器（與原始問題完全無關）的難題，以顯示如何，一旦正確輸入了一個Unicode字符串（我是通過代碼點來做的 - 傻瓜但不狡猾;-)，搜索絕對是一個明智的選擇（因此希望原始問題得到徹底解答）。 再假設一個UTF-8終端：

>>> x=u'\u0413\u043b\u0430\u0432\u043d\u043e\u0435'
>>> print x.encode('utf-8')
Главное
>>> x in ucontent
True
>>> ucontent.find(x)
93

注意：請記住，此方法可能不適用於所有站點，因為某些站點僅在服務文檔中指定字符編碼（例如，使用http-equiv元標記）。

Answer 2

要解析Content-Type http標頭，可以使用cgi.parse_header函數：

import cgi
import urllib2

r = urllib2.urlopen('http://lenta.ru')
_, params = cgi.parse_header(r.headers.get('Content-Type', ''))
encoding = params.get('charset', 'utf-8')
unicode_text = r.read().decode(encoding)

另一種獲取字符集的方法：

>>> import urllib2
>>> r = urllib2.urlopen('http://lenta.ru')
>>> r.headers.getparam('charset')
'utf-8'

或者在Python 3中：

>>> import urllib.request
>>> r = urllib.request.urlopen('http://lenta.ru')
>>> r.headers.get_content_charset()
'utf-8'

字符編碼也可以在html文檔中指定，例如<meta charset="utf-8"> 。

urllib2讀取為Unicode

問題描述

2 個解決方案

解決方案1
99 已采納 2009-06-20 04:17:41

解決方案2
10 2013-12-21 02:23:33

urllib2讀取為Unicode

問題描述

2 個解決方案

解決方案1 99 已采納 2009-06-20 04:17:41

解決方案2 10 2013-12-21 02:23:33

解決方案1
99 已采納 2009-06-20 04:17:41

解決方案2
10 2013-12-21 02:23:33