简体   繁体   English

使用python从网站上读写非英语字符

[英]Reading and writing non-English characters from websites with python

I'm doing a bit of data scraping on Wikipedia, and I want to read certain entries. 我正在Wikipedia上进行一些数据抓取,并且我想读取某些条目。 I'm using the urllib.urlopen('http://www.example.com') and urllib.read() 我正在使用urllib.urlopen('http://www.example.com')urllib.read()

This works fine until it encounters non English characters like Stanislav Šesták Here's are the first few lines: 直到遇到非英语字符(如StanislavŠesták),此方法才能正常工作。这是前几行:

import urllib

print urllib.urlopen("http://en.wikipedia.org/wiki/Stanislav_Šesták").read()

result: 结果:

<!DOCTYPE html>
<html lang="en" dir="ltr" class="client-nojs">
<head>
<meta charset="UTF-8" /><title>Stanislav ֵ estֳ¡k - Wikipedia, the free encyclopedia</title>
<meta name="generator" content="MediaWiki 1.23wmf8" />
<link rel="alternate" type="application/x-wiki" title="Edit this page" href="/w/index.php?title=Stanislav_%C5%A0est%C3%A1k&amp;action=edit" />
<link rel="edit" title="Edit this page" href="/w/index.php?title=Stanislav_%C5%A0est%C3%A1k&amp;action=edit" />
<link rel="apple-touch-icon" href="//bits.wikimedia.org/apple-touch/wikipedia.png" />

How can I retain the non-English characters? 如何保留非英文字符? In the end this code will write the entry title and the URL in a .txt file. 最后,此代码将在.txt文件中写入条目标题和URL。

There are multiple issues: 存在多个问题:

  • non-ascii characters in a string literal: you must specify encoding declaration at the top of the module in this case 字符串文字中的非ASCII字符:在这种情况下,必须在模块顶部指定编码声明
  • you should urlencode the url path ( u"Stanislav_Šesták" -> "Stanislav_%C5%A0est%C3%A1k" ) 您应该对网址路径进行urlencode( u"Stanislav_Šesták" -> "Stanislav_%C5%A0est%C3%A1k"
  • you are printing bytes received from a web to your terminal. 您正在将从网络接收的字节打印到终端。 Unless both use the same character encoding then you might see garbage instead of some characters 除非两者都使用相同的字符编码,否则您可能会看到垃圾而不是某些字符
  • to interpret html, you should probably use an html parser 解释html,您可能应该使用html解析器

Here's a code example that takes into account the above remarks: 这是一个考虑了上述说明的代码示例:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import cgi
import urllib
import urllib2

wiki_title = u"Stanislav_Šesták"
url_path = urllib.quote(wiki_title.encode('utf-8'))
r = urllib2.urlopen("https://en.wikipedia.org/wiki/" + url_path)
_, params = cgi.parse_header(r.headers.get('Content-Type', ''))
encoding = params.get('charset')
content = r.read()
unicode_text = content.decode(encoding or 'utf-8')
print unicode_text # if it fails; set PYTHONIOENCODING

Related: 有关:

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM