使用python从网站上读写非英语字符

Question

I'm doing a bit of data scraping on Wikipedia, and I want to read certain entries. 我正在Wikipedia上进行一些数据抓取，并且我想读取某些条目。 I'm using the urllib.urlopen('http://www.example.com') and urllib.read() 我正在使用urllib.urlopen('http://www.example.com')和urllib.read()

This works fine until it encounters non English characters like Stanislav Šesták Here's are the first few lines: 直到遇到非英语字符（如StanislavŠesták），此方法才能正常工作。这是前几行：

import urllib

print urllib.urlopen("http://en.wikipedia.org/wiki/Stanislav_Šesták").read()

result: 结果：

<!DOCTYPE html>
<html lang="en" dir="ltr" class="client-nojs">
<head>
<meta charset="UTF-8" /><title>Stanislav ֵ estֳ¡k - Wikipedia, the free encyclopedia</title>
<meta name="generator" content="MediaWiki 1.23wmf8" />
<link rel="alternate" type="application/x-wiki" title="Edit this page" href="/w/index.php?title=Stanislav_%C5%A0est%C3%A1k&amp;action=edit" />
<link rel="edit" title="Edit this page" href="/w/index.php?title=Stanislav_%C5%A0est%C3%A1k&amp;action=edit" />
<link rel="apple-touch-icon" href="//bits.wikimedia.org/apple-touch/wikipedia.png" />

How can I retain the non-English characters? 如何保留非英文字符？ In the end this code will write the entry title and the URL in a .txt file. 最后，此代码将在.txt文件中写入条目标题和URL。

Answer 1

There are multiple issues: 存在多个问题：

non-ascii characters in a string literal: you must specify encoding declaration at the top of the module in this case 字符串文字中的非ASCII字符：在这种情况下，必须在模块顶部指定编码声明
you should urlencode the url path ( u"Stanislav_Šesták" -> "Stanislav_%C5%A0est%C3%A1k" ) 您应该对网址路径进行urlencode（ u"Stanislav_Šesták" -> "Stanislav_%C5%A0est%C3%A1k" ）
you are printing bytes received from a web to your terminal. 您正在将从网络接收的字节打印到终端。 Unless both use the same character encoding then you might see garbage instead of some characters 除非两者都使用相同的字符编码，否则您可能会看到垃圾而不是某些字符
to interpret html, you should probably use an html parser 解释html，您可能应该使用html解析器

Here's a code example that takes into account the above remarks: 这是一个考虑了上述说明的代码示例：

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import cgi
import urllib
import urllib2

wiki_title = u"Stanislav_Šesták"
url_path = urllib.quote(wiki_title.encode('utf-8'))
r = urllib2.urlopen("https://en.wikipedia.org/wiki/" + url_path)
_, params = cgi.parse_header(r.headers.get('Content-Type', ''))
encoding = params.get('charset')
content = r.read()
unicode_text = content.decode(encoding or 'utf-8')
print unicode_text # if it fails; set PYTHONIOENCODING

Related: 有关：

使用python从网站上读写非英语字符

问题描述

1 个解决方案

解决方案1
1 已采纳 2014-01-06 20:46:20

使用python从网站上读写非英语字符

问题描述

1 个解决方案

解决方案1 1 已采纳 2014-01-06 20:46:20

解决方案1
1 已采纳 2014-01-06 20:46:20