简体   繁体   English

Scrapy-由于编码而无法跟踪链接

[英]Scrapy - can't follow links due to encoding

I'm trying to scrape some data from allabolag.se. 我正在尝试从allabolag.se抓取一些数据。 I want to follow the links at eg http://www.allabolag.se/5565794400/befattningar but scrapy does not get the links correctly. 我想关注例如http://www.allabolag.se/5565794400/befattningar上的链接,但scrapy无法正确获取链接。 It lacks "52" right after "%2" in the URL. URL中紧接在“%2”之后缺少“ 52”。

Example, I want to go to: http://www.allabolag.se/befattningshavare/de_Sauvage-Nolting%252C_Henri_Jacob_Jan/f6da68933af6383498691f19de7ebd4b 例如,我想去: http : //www.allabolag.se/befattningshavare/de_Sauvage-Nolting%252C_Henri_Jacob_Jan/f6da68933af6383498691f19de7ebd4b

But scrapy gets to following link: http://www.allabolag.se/befattningshavare/de_Sauvage-Nolting%2C_Henri_Jacob_Jan/f6da68933af6383498691f19de7ebd4b 但是,令人毛骨悚然的是访问以下链接: http : //www.allabolag.se/befattningshavare/de_Sauvage-Nolting%2C_Henri_Jacob_Jan/f6da68933af6383498691f19de7ebd4b

I read on this site that it got something to do with encodings: https://www.owasp.org/index.php/Double_Encoding 我在此网站上阅读到它与编码有关: https : //www.owasp.org/index.php/Double_Encoding

How do I get around this? 我该如何解决?

My code is as follows: 我的代码如下:

# -*- coding: utf-8 -*-

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from allabolag.items import AllabolagItem
from scrapy.loader.processors import Join


class allabolagspider(CrawlSpider):
    name="allabolagspider"
    # allowed_domains = ["byralistan.se"]
    start_urls = [
        "http://www.allabolag.se/5565794400/befattningar"
    ]

    rules = (
        Rule(LinkExtractor(allow = "http://www.allabolag.se", restrict_xpaths=('//*[@id="printContent"]//a[1]')), callback='parse_link'),
    )

    def parse_link(self, response):
        for sel in response.xpath('//*[@id="printContent"]'):
            item = AllabolagItem()
            item['Byra'] = sel.xpath('/div[2]/table/tbody/tr[3]/td/h1').extract()
            item['Namn'] = sel.xpath('/div[2]/table/tbody/tr[3]/td/h1').extract()
            item['Gender'] = sel.xpath('/div[2]/table/tbody/tr[3]/td/h1').extract()
            item['Alder'] = sel.xpath('/div[2]/table/tbody/tr[3]/td/h1').extract()
            yield item

You can configure the link extractor to not canonicalize the URLs by passing canonicalize=False 您可以通过传递canonicalize=False来配置链接提取程序以不对URL进行canonicalize=False

Scrapy shell session: Scrapy shell会话:

$ scrapy shell http://www.allabolag.se/5565794400/befattningar
>>> from scrapy.linkextractors import LinkExtractor
>>> le = LinkExtractor()
>>> for l in le.extract_links(response):
...     print l
...
(...stripped...)
Link(url='http://www.allabolag.se/befattningshavare/de_Sauvage-Nolting%2C_Henri_Jacob_Jan_Personprofil/f6da68933af6383498691f19de7ebd4b', text=u'', fragment='', nofollow=False)
(...stripped...)
>>> fetch('http://www.allabolag.se/befattningshavare/de_Sauvage-Nolting%2C_Henri_Jacob_Jan_Personprofil/f6da68933af6383498691f19de7ebd4b')
2016-03-02 11:48:07 [scrapy] DEBUG: Crawled (404) <GET http://www.allabolag.se/befattningshavare/de_Sauvage-Nolting%2C_Henri_Jacob_Jan_Personprofil/f6da68933af6383498691f19de7ebd4b> (referer: None)
>>> 

>>> le = LinkExtractor(canonicalize=False)
>>> for l in le.extract_links(response):
...     print l
... 
(...stripped...)
Link(url='http://www.allabolag.se/befattningshavare/de_Sauvage-Nolting%252C_Henri_Jacob_Jan_Personprofil/f6da68933af6383498691f19de7ebd4b', text=u'', fragment='', nofollow=False)
(...stripped...)
>>> 
>>> fetch('http://www.allabolag.se/befattningshavare/de_Sauvage-Nolting%252C_Henri_Jacob_Jan_Personprofil/f6da68933af6383498691f19de7ebd4b')
2016-03-02 11:47:42 [scrapy] DEBUG: Crawled (200) <GET http://www.allabolag.se/befattningshavare/de_Sauvage-Nolting%252C_Henri_Jacob_Jan_Personprofil/f6da68933af6383498691f19de7ebd4b> (referer: None)

So you should be good with: 因此,您应该擅长:

class allabolagspider(CrawlSpider):
    name="allabolagspider"
    # allowed_domains = ["byralistan.se"]
    start_urls = [
        "http://www.allabolag.se/5565794400/befattningar"
    ]

    rules = (
        Rule(LinkExtractor(allow = "http://www.allabolag.se",
                           restrict_xpaths=('//*[@id="printContent"]//a[1]'),
                           canonicalize=False),
             callback='parse_link'),
    )
    ...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM