简体   繁体   English

Scrapy python:unicode链接错误

[英]Scrapy python : unicode links error

link encoding 链接编码

when scraping a site scrapy extracts links containing &amd and throws excption: Do not instantiate Link objects with unicode urls. 抓取网站时,scrazy会提取包含&amd的链接并抛出异常:不要使用unicode url实例化Link对象。 Assuming utf-8 encoding (which could be wrong) so how can i fix this error? 假设utf-8编码(可能是错误的),那么如何解决此错误?

I had the same problem with this character inserted on some links. 我对此字符有相同的问题插入了某些链接。 I found this related commit on github and than used this advice to write a file link_extractors.py with: 我在github上找到了这个相关的提交 ,然后用这个建议写了一个文件link_extractors.py

from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.utils.response import get_base_url


class CustomLinkExtractor(SgmlLinkExtractor):
"""Need this to fix the encoding error."""

    def extract_links(self, response):
        base_url = None
        if self.restrict_xpaths:
            hxs = HtmlXPathSelector(response)
            base_url = get_base_url(response)
            body = u''.join(f for x in self.restrict_xpaths
                           for f in hxs.select(x).extract())
            try:
                body = body.encode(response.encoding)
            except UnicodeEncodeError:
                body = body.encode('utf-8')
        else:
            body = response.body

        links = self._extract_links(body, response.url, response.encoding, base_url)
        links = self._process_links(links)
        return links

Afterwards I used it in my spiders.py : 之后,我在我的spiders.py使用了它:

rules = (
    Rule(CustomLinkExtractor(allow=('/gp/offer-listing*', ),
                           restrict_xpaths=("//li[contains(@class,'a-last')]/a", )),
         callback='parse_start_url', follow=True,

         ),
)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM