简体   繁体   English

Python刮板的Unicode问题

[英]Unicode issue with Python scraper

I've been writing bad perl for a while, but am attempting to learn to write bad python instead. 我已经写了一段时间的糟糕的Perl,但正尝试学习编写糟糕的python。 I've read around the problem I've been having for a couple of days now (and know an awful lot more about unicode as a result) but I'm still having troubles with a rogue em-dash in the following code: 我已经阅读了几天以来一直遇到的问题(因此,我对unicode有了更多的了解),但是在下面的代码中,流氓破折号仍然很麻烦:

import urllib2

def scrape(url):
# simplified
    data = urllib2.urlopen(url)
    return data.read()

def query_graph_api(url_list):
# query Facebook's Graph API, store data.
    for url in url_list:
        graph_query = graph_query_root + "%22" + url + "%22"
        query_data = scrape(graph_query)
        print query_data #debug console

### START HERE ####

graph_query_root = "https://graph.facebook.com/fql?q=SELECT%20normalized_url,share_count,like_count,comment_count,total_count%20FROM%20link_stat%20WHERE%20url="

url_list = ['http://www.supersavvyme.co.uk',  'http://www.supersavvyme.co.uk/article/how-to-be-happy–laugh-more']

query_graph_api(url_list)

(This is a much simplified representation of the scraper, BTW. The original uses a site's sitemap.xml to build a list of URLs, then queries Facebook's Graph API for information on each -- here's the original scraper ) (这是刮板BTW的简化表示。原始板使用站点的sitemap.xml构建URL列表,然后查询Facebook的Graph API以获取有关每个刮板的信息-这是原始刮板

My attempts to debug this have consisted mostly of trying to emulate the infinite monkeys who are rewriting Shakespeare. 我的调试尝试主要包括尝试模仿重写莎士比亚的无限猴子。 My usual method (search StackOverflow for the error message, copy-and-paste the solution) has failed. 我通常的方法(搜索StackOverflow以获得错误消息,然后复制并粘贴解决方案)失败了。

Question: how do I encode my data so that extended characters like the em-dash in the second URL won't break my code, but will still work in the FQL query? 问题:如何编码数据,以便第二个URL中的扩展字符(如破折号)不会破坏我的代码,但仍可在FQL查询中使用?

PS I'm even wondering whether I'm asking the right question: might urllib.urlencode help me out here (certainly it would make that graph_query_root easier and prettier to create... PS我什至想知道我是否在问一个正确的问题: urllib.urlencode可以在这里帮助我graph_query_root (肯定会使得graph_query_root更容易和更漂亮地创建...

---8<---- --- 8 <----

The traceback I get from the actual scraper on ScraperWiki is as follows: 我从ScraperWiki上的实际刮板获得的追溯如下:

http://www.supersavvyme.co.uk/article/how-to-be-happy–laugh-more
Line 80 - query_graph_api(urls)
Line 53 - query_data = scrape(graph_query) -- query_graph_api((urls=['http://www.supersavvyme.co.uk', 'http://...more
Line 21 - data = urllib2.urlopen(unicode(url)) -- scrape((url=u'https://graph.facebook.com/fql?q=SELECT%20url,...more
/usr/lib/python2.7/urllib2.py:126 -- urlopen((url=u'https://graph.facebook.com/fql?q=SELECT%20url,no...more
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 177: ordinal not in range(128)

If you are using Python 3.x, all you have to do is add one line and change another: 如果您使用的是Python 3.x,则只需添加一行并更改另一行:

gq = graph_query.encode('utf-8')
query_data = scrape(gq)

If you are using Python 2.x, first put the following line in at the top of the module file: 如果您使用的是Python 2.x,请首先将以下行放在模块文件的顶部:

# -*- coding: utf-8 -*- (read what this is for here ) # -*- coding: utf-8 -*- (在这里阅读说明)

and then make all your string literals unicode and encode just before passing to urlopen: 然后在传递给urlopen之前使所有字符串文字变为unicode并进行编码:

def scrape(url):
# simplified
    data = urllib2.urlopen(url)
    return data.read()

def query_graph_api(url_list):
# query Facebook's Graph API, store data.
    for url in url_list:
        graph_query = graph_query_root + u"%22" + url + u"%22"
        gq = graph_query.encode('utf-8')
        query_data = scrape(gq)
        print query_data #debug console

### START HERE ####

graph_query_root = u"https://graph.facebook.com/fql?q=SELECT%20normalized_url,share_count,like_count,comment_count,total_count%20FROM%20link_stat%20WHERE%20url="

url_list = [u'http://www.supersavvyme.co.uk', u'http://www.supersavvyme.co.uk/article/how-to-be-happy–laugh-more']

query_graph_api(url_list)

It looks from the code like you are using 3.x, which is really better for dealing with stuff like this. 从代码中看,您正在使用3.x,这对于处理此类问题确实更好。 But you still have to encode when necessary. 但是您仍然必须在必要时进行编码。 In 2.x, the best advice is to do what 3.x does by default: use unicode throughout your code, and only encode when bytes are called for. 在2.x中,最好的建议是执行3.x默认情况下的操作:在代码中使用unicode,并且仅在需要字节时才进行编码。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM