[英]Reading and writing non-English characters from websites with python
I'm doing a bit of data scraping on Wikipedia, and I want to read certain entries. 我正在Wikipedia上进行一些数据抓取,并且我想读取某些条目。 I'm using the
urllib.urlopen('http://www.example.com')
and urllib.read()
我正在使用
urllib.urlopen('http://www.example.com')
和urllib.read()
This works fine until it encounters non English characters like Stanislav Šesták Here's are the first few lines: 直到遇到非英语字符(如StanislavŠesták),此方法才能正常工作。这是前几行:
import urllib
print urllib.urlopen("http://en.wikipedia.org/wiki/Stanislav_Šesták").read()
result: 结果:
<!DOCTYPE html>
<html lang="en" dir="ltr" class="client-nojs">
<head>
<meta charset="UTF-8" /><title>Stanislav ֵ estֳ¡k - Wikipedia, the free encyclopedia</title>
<meta name="generator" content="MediaWiki 1.23wmf8" />
<link rel="alternate" type="application/x-wiki" title="Edit this page" href="/w/index.php?title=Stanislav_%C5%A0est%C3%A1k&action=edit" />
<link rel="edit" title="Edit this page" href="/w/index.php?title=Stanislav_%C5%A0est%C3%A1k&action=edit" />
<link rel="apple-touch-icon" href="//bits.wikimedia.org/apple-touch/wikipedia.png" />
How can I retain the non-English characters? 如何保留非英文字符? In the end this code will write the entry title and the URL in a .txt file.
最后,此代码将在.txt文件中写入条目标题和URL。
There are multiple issues: 存在多个问题:
u"Stanislav_Šesták"
-> "Stanislav_%C5%A0est%C3%A1k"
) u"Stanislav_Šesták"
-> "Stanislav_%C5%A0est%C3%A1k"
) Here's a code example that takes into account the above remarks: 这是一个考虑了上述说明的代码示例:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import cgi
import urllib
import urllib2
wiki_title = u"Stanislav_Šesták"
url_path = urllib.quote(wiki_title.encode('utf-8'))
r = urllib2.urlopen("https://en.wikipedia.org/wiki/" + url_path)
_, params = cgi.parse_header(r.headers.get('Content-Type', ''))
encoding = params.get('charset')
content = r.read()
unicode_text = content.decode(encoding or 'utf-8')
print unicode_text # if it fails; set PYTHONIOENCODING
Related: 有关:
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.