简体   繁体   中英

How to Import URLs From Spider to Spider?

I am building a Scrapy spider WuzzufLinks that scrapes all the links to specific jobs in a job website in this link: https://wuzzuf.net/search/jobs/?filters%5Bcountry%5D%5B0%5D=Egypt

After scraping the links, I would like to send them to another spider WuzzufSpider , which scrapes data from inside each link. The start_urls would be the first link in the scraped list, and the next_page would be the following link, and so on.

I have thought of importing the WuzzufLinks into WuzzufSpider then accessing its data:

在此处输入图像描述

在此处输入图像描述

Regardless of whether I have written the outlined parts correctly, I have realized that accessing jobURL would return an empty value since it is only a temporary container. I have thought of saving the scraped links in another file, then importing them to WuzzufSpider , but I don't know whether the import is valid and if they will still be a list:

在此处输入图像描述

在此处输入图像描述

Is there is a way to make the second method work or a completely different approach?

I have checked forums Scrapy:Pass data between 2 spiders and Pass scraped URL's from one spider to another . I understand that I can do all of the work in one spider, and that there is a way to save to a database or temporary file in order to send data to another spider. However I am not yet very experienced and don't understand how to implement such changes, so marking this question as a duplicate won't help me. Thank you for your help.

First of all you can keep crawling the urls from the same spider and honestly I don't see a reason for you not to.

Anyway, if you really want to have two spiders, which the output of the first will be the input of the second, you can do something like this:

import scrapy
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from scrapy.utils.project import get_project_settings
from scrapy.signalmanager import dispatcher
from scrapy import signals
from twisted.internet import reactor, defer


# grab all the products urls
class ExampleSpider(scrapy.Spider):
    name = "exampleSpider"
    start_urls = ['https://scrapingclub.com/exercise/list_basic']

    def parse(self, response):
        all_urls = response.xpath('//div[@class="card"]/a/@href').getall()
        for url in all_urls:
            yield {'url': 'https://scrapingclub.com' + url}


# get the product's details
class ExampleSpider2(scrapy.Spider):
    name = "exampleSpider2"

    def parse(self, response):
        title = response.xpath('//h3/text()').get()
        price = response.xpath('//div[@class="card-body"]//h4//text()').get()
        yield {
            'title': title,
            'price': price
        }


if __name__ == "__main__":
    # this will be the yielded items from the first spider
    output = []

    def get_output(item):
        output.append(item)

    configure_logging()
    settings = get_project_settings()
    settings['USER_AGENT'] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
    runner = CrawlerRunner(settings)

    # run spiders sequentially
    # (https://docs.scrapy.org/en/latest/topics/practices.html#running-multiple-spiders-in-the-same-process)
    @defer.inlineCallbacks
    def crawl():
        dispatcher.connect(get_output, signal=signals.item_scraped)
        yield runner.crawl('exampleSpider')
        urls = [url['url'] for url in output]   # create a list of the urls from the first spider

        # crawl the second spider with the urls from the first spider
        yield runner.crawl('exampleSpider2', start_urls=urls)
        reactor.stop()

    crawl()
    reactor.run()

Run this and see that you first get the results from the first spider, and that those results are passed as the "start_urls" for the second spider.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM