Scrapy python：unicode链接错误

Question

link encoding 链接编码

when scraping a site scrapy extracts links containing &amd and throws excption: Do not instantiate Link objects with unicode urls. 抓取网站时，scrazy会提取包含＆amd的链接并抛出异常：不要使用unicode url实例化Link对象。 Assuming utf-8 encoding (which could be wrong) so how can i fix this error? 假设utf-8编码（可能是错误的），那么如何解决此错误？

Answer 1

I had the same problem with this character → inserted on some links. 我对此字符有相同的问题→插入了某些链接。 I found this related commit on github and than used this advice to write a file link_extractors.py with: 我在github上找到了这个相关的提交，然后用这个建议写了一个文件link_extractors.py ：

from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.utils.response import get_base_url


class CustomLinkExtractor(SgmlLinkExtractor):
"""Need this to fix the encoding error."""

    def extract_links(self, response):
        base_url = None
        if self.restrict_xpaths:
            hxs = HtmlXPathSelector(response)
            base_url = get_base_url(response)
            body = u''.join(f for x in self.restrict_xpaths
                           for f in hxs.select(x).extract())
            try:
                body = body.encode(response.encoding)
            except UnicodeEncodeError:
                body = body.encode('utf-8')
        else:
            body = response.body

        links = self._extract_links(body, response.url, response.encoding, base_url)
        links = self._process_links(links)
        return links

Afterwards I used it in my spiders.py : 之后，我在我的spiders.py使用了它：

rules = (
    Rule(CustomLinkExtractor(allow=('/gp/offer-listing*', ),
                           restrict_xpaths=("//li[contains(@class,'a-last')]/a", )),
         callback='parse_start_url', follow=True,

         ),
)

Scrapy python：unicode链接错误

问题描述

1 个解决方案

解决方案1
0 2013-11-05 20:46:04

Scrapy python：unicode链接错误

问题描述

1 个解决方案

解决方案1 0 2013-11-05 20:46:04

解决方案1
0 2013-11-05 20:46:04