简体   繁体   English

从多个 start_url 顺序抓取导致解析错误

[英]Sequential scraping from multiple start_urls leading to error in parsing

First, highest appreciation for all of your work answering noob questions like this one.首先,非常感谢您回答此类菜鸟问题的所有工作。

Second, as it seems to be a quite common problem I was finding (IMO) related questions such as: Scrapy: Wait for a specific url to be parsed before parsing others其次,这似乎是一个很常见的问题,我发现(IMO)相关问题,例如: Scrapy:在解析其他 URL 之前等待解析特定的 url

However, at my current state of understanding it is not straightforward to adapt the suggestions in my specific case and I would really appreciate your help.但是,根据我目前的理解,在我的具体情况下调整这些建议并不简单,我非常感谢您的帮助。

Problem Outline: running on (Python 3.7.1, Scrapy 1.5.1)问题大纲:在(Python 3.7.1,Scrapy 1.5.1)上运行

I want to scrape data from every link collected on pages like this https://www.gipfelbuch.ch/gipfelbuch/touren/seite/1我想从这样的页面上收集的每个链接中抓取数据https://www.gipfelbuch.ch/gipfelbuch/touren/seite/1

then from all links on another collection然后从另一个集合上的所有链接

https://www.gipfelbuch.ch/gipfelbuch/touren/seite/650 https://www.gipfelbuch.ch/gipfelbuch/touren/seite/650

I manage to get the desired information (only two elements shown here) if I run the spider for one (eg page 1 or 650) at a time.如果我一次运行一个蜘蛛(例如第 1 页第 650 页),我设法获得所需的信息(此处仅显示两个元素)。 (Note that I restircted the length of links that is crawled per page to 2.) However, once I have multiple start start_urls (setting two elements in the list [1,650] in the code below) the parsed data is no more consistent. (请注意,我将每页抓取的链接长度限制为 2。)但是,一旦我有多个 start_urls(在下面的代码中的列表 [1,650] 中设置两个元素),解析的数据就不再一致。 Apparently at least one element is not found by xpath.显然,xpath 至少找不到一个元素。 I am suspecting some (or a lot of) incorrect logic how I handle/pass the requests that leads not to the intendet order for parsing.我怀疑一些(或很多)不正确的逻辑我如何处理/传递导致不用于解析的意图顺序的请求。

Code:代码:

class SlfSpider1Spider(CrawlSpider):
    name = 'slf_spider1'
    custom_settings = { 'CONCURRENT_REQUESTS': '1' }    
    allowed_domains = ['gipfelbuch.ch']
    start_urls = ['https://www.gipfelbuch.ch/gipfelbuch/touren/seite/'+str(i) for i in [1,650]]

    # Method which starts the requests by vicisting all URLS specified in start_urls
    def start_requests(self):
        for url in self.start_urls:
            print('#### START REQUESTS: ',url)
            yield scrapy.Request(url, callback=self.parse_verhaeltnisse, dont_filter=True)

    def parse_verhaeltnisse(self,response):
        links = response.xpath('//td//@href').extract()
        for link in links[0:2]:
            print('##### PARSING: ',link)
            abs_link = 'https://www.gipfelbuch.ch/'+link
            yield scrapy.Request(abs_link, callback=self.parse_gipfelbuch_item, dont_filter=True)


    def parse_gipfelbuch_item(self, response):
        route = response.xpath('/html/body/main/div[4]/div[@class="col_f"]//div[@class="togglebox cont_item mt"]//div[@class="label_container"]')

        print('#### PARSER OUTPUT: ')

        key=[route[i].xpath('string(./label)').extract()[0] for i in range(len(route))]
        value=[route[i].xpath('string(div[@class="label_content"])').extract()[0] for i in range(len(route))]
        fields = dict(zip(key,value))

        print('Route: ', fields['Gipfelname'])
        print('Comments: ', fields['Verhältnis-Beschreibung'])

        print('Length of dict extracted from Route: {}'.format(len(route)))
        return

Command prompt命令提示符

2019-03-18 15:42:27 [scrapy.core.engine] INFO: Spider opened
2019-03-18 15:42:27 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-03-18 15:42:27 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024
#### START REQUESTS:  https://www.gipfelbuch.ch/gipfelbuch/touren/seite/1
2019-03-18 15:42:28 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.gipfelbuch.ch/gipfelbuch/touren/seite/1> (referer: None)
#### START REQUESTS:  https://www.gipfelbuch.ch/gipfelbuch/touren/seite/650
##### PARSING:  /gipfelbuch/detail/id/101559/Skitour_Snowboardtour/Beaufort
##### PARSING:  /gipfelbuch/detail/id/101557/Skitour_Snowboardtour/Blinnenhorn
2019-03-18 15:42:30 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.gipfelbuch.ch/gipfelbuch/touren/seite/650> (referer: None)
##### PARSING:  /gipfelbuch/detail/id/69022/Alpine_Wanderung/Schwaendeliflue
##### PARSING:  /gipfelbuch/detail/id/69021/Schneeschuhtour/Cima_Portule

2019-03-18 15:42:32 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.gipfelbuch.ch//gipfelbuch/detail/id/101557/Skitour_Snowboardtour/Blinnenhorn> (referer: https://www.gipfelbuch.ch/gipfelbuch/touren/seite/1)
#### PARSER OUTPUT:
Route:  Blinnenhorn/Corno Cieco
Comments:  Am Samstag Aufstieg zur Corno Gries Hütte, ca. 2,5h ab All Acqua. Zustieg problemslos auf guter Spur. Zur Verwunderung waren wir die einzigsten auf der Hütte. Danke an Monika für die herzliche Bewirtung...
Length of dict extracted from Route: 27

2019-03-18 15:42:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.gipfelbuch.ch//gipfelbuch/detail/id/69021/Schneeschuhtour/Cima_Portule> (referer: https://www.gipfelbuch.ch/gipfelbuch/touren/seite/650)
#### PARSER OUTPUT:
Route:  Cima Portule
Comments:  Sehr viel Schnee in dieser Gegend und viel Spirarbeit geleiset, deshalb auch viel Zeit gebraucht.
Length of dict extracted from Route: 19

2019-03-18 15:42:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.gipfelbuch.ch//gipfelbuch/detail/id/69022/Alpine_Wanderung/Schwaendeliflue> (referer: https://www.gipfelbuch.ch/gipfelbuch/touren/seite/650)
#### PARSER OUTPUT:
Route:  Schwändeliflue
Comments:  Wege und Pfade meist schneefrei, da im Gebiet viel Hochmoor ist, z.t. sumpfig.  Oberhalb 1600m und in Schattenlagen bis 1400m etwas Schnee  (max.Schuhtief).  Wetter sonnig und sehr warm für die Jahreszeit, T-Shirt - Wetter,  Frühlingshaft....
Length of dict extracted from Route: 17

2019-03-18 15:42:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.gipfelbuch.ch//gipfelbuch/detail/id/101559/Skitour_Snowboardtour/Beaufort> (referer: https://www.gipfelbuch.ch/gipfelbuch/touren/seite/1)
#### PARSER OUTPUT:
Route:  Beaufort
2019-03-18 15:42:40 [scrapy.core.scraper] **ERROR: Spider error processing <GET https://www.gipfelbuch.ch//gipfelbuch/detail/id/101559/Skitour_Snowboardtour/Beaufort> (referer: https://www.gipfelbuch.ch/gipfelbuch/touren/seite/1)
Traceback (most recent call last):
  File "C:\Users\Lenovo\Anaconda3\lib\site-packages\twisted\internet\defer.py", line 654, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "C:\Users\Lenovo\Dropbox\Code\avalanche\scrapy\slf1\slf1\spiders\slf_spider1.py", line 38, in parse_gipfelbuch_item
    print('Comments: ', fields['Verhältnis-Beschreibung'])
**KeyError: 'Verhältnis-Beschreibung'****
2019-03-18 15:42:40 [scrapy.core.engine] INFO: Closing spider (finished)

Question: How do I have to structure the first (for links) and second (for content) parsing commands correctly?问题:如何正确构建第一个(用于链接)和第二个(用于内容)解析命令? Why is the "PARSE OUTPUT" not in the order i would expect (first for page 1, links top to bottom, then page 2, links top to bottom)?为什么“解析输出”不是我期望的顺序(首先是第 1 页,从上到下链接,然后是第 2 页,从上到下链接)?

I already tried to reduce the number of CONCURRENT_REQUESTS = 1 and DOWNLOAD_DELAY = 2.我已经尝试减少 CONCURRENT_REQUESTS = 1 和 DOWNLOAD_DELAY = 2 的数量。

I hope the question is clear enough... big thanks in advance.我希望这个问题足够清楚......在此先感谢

If the problem is to visit more URLs at the same time, you can visit one by one, using the signal spider_idle ( https://docs.scrapy.org/en/latest/topics/signals.html ).如果问题是同时访问多个 URL,可以一个一个访问,使用信号 spider_idle ( https://docs.scrapy.org/en/latest/topics/signals.html )。

The idea is the following:思路如下:

1.start_requests only visits the first URL 1.start_requests 只访问第一个URL

2.when the spider gets idle, the method spider_idle is called 2.当spider空闲时,调用spider_idle方法

3.the method spider_idle deletes the first URL and visits the second URL 3.方法spider_idle删除第一个URL,访问第二个URL

4.so on... 4.等等...

The code would be something like this (I didn't try it):代码将是这样的(我没有尝试过):

class SlfSpider1Spider(CrawlSpider):
    name = 'slf_spider1'
    custom_settings = { 'CONCURRENT_REQUESTS': '1' }   
    allowed_domains = ['gipfelbuch.ch']
    start_urls = ['https://www.gipfelbuch.ch/gipfelbuch/touren/seite/'+str(i) for i in [1,650]]

    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        spider = super(SlfSpider1Spider, cls).from_crawler(crawler, *args, **kwargs)
        # Here you set which method the spider has to run when it gets idle
        crawler.signals.connect(spider.spider_idle, signal=signals.spider_idle)
        return spider

    # Method which starts the requests by vicisting all URLS specified in start_urls
    def start_requests(self):
        # the spider visits only the first provided URL
        url = self.start_urls[0]:
        print('#### START REQUESTS: ',url)
        yield scrapy.Request(url, callback=self.parse_verhaeltnisse, dont_filter=True)

    def parse_verhaeltnisse(self,response):
        links = response.xpath('//td//@href').extract()
        for link in links[0:2]:
            print('##### PARSING: ',link)
            abs_link = 'https://www.gipfelbuch.ch/'+link
            yield scrapy.Request(abs_link, callback=self.parse_gipfelbuch_item, dont_filter=True)


    def parse_gipfelbuch_item(self, response):
        route = response.xpath('/html/body/main/div[4]/div[@class="col_f"]//div[@class="togglebox cont_item mt"]//div[@class="label_container"]')

        print('#### PARSER OUTPUT: ')

        key=[route[i].xpath('string(./label)').extract()[0] for i in range(len(route))]
        value=[route[i].xpath('string(div[@class="label_content"])').extract()[0] for i in range(len(route))]
        fields = dict(zip(key,value))

        print('Route: ', fields['Gipfelname'])
        print('Comments: ', fields['Verhältnis-Beschreibung'])

        print('Length of dict extracted from Route: {}'.format(len(route)))
        return

    # When the spider gets idle, it deletes the first url and visits the second, and so on...
    def spider_idle(self, spider):
        del(self.start_urls[0])
        if len(self.start_urls)>0:
            url = self.start_urls[0]
            self.crawler.engine.crawl(Request(url, callback=self.parse_verhaeltnisse, dont_filter=True), spider)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM