简体   繁体   中英

Scrapy Spider stops before crawling anything

So I have a django project and a views.py from which I want to call a Scrapy spider if a certain condition is met. The crawler seems to be called just fine but terminates so quickly that the parse function is not called (that is my assumption at least) as shown below:

2020-11-16 18:51:25 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'products',
 'NEWSPIDER_MODULE': 'crawler.spiders',
 'SPIDER_MODULES': ['crawler.spiders.my_spider'],
 'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, '
               'like Gecko) Chrome/80.0.3987.149 Safari/537.36'}
2020-11-16 18:51:25 [scrapy.extensions.telnet] INFO: Telnet Password: ******
2020-11-16 18:51:25 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
['https://www.tesco.com/groceries/en-GB/products/307358055']
2020-11-16 18:51:26 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-11-16 18:51:26 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
[16/Nov/2020 18:51:26] "POST /productsinfo HTTP/1.1" 200 2

views.py

def get_info():
  url = data[product]["url"]
  setup()
  runner(url)
  products = []
  serializer = ProductSerializer(products, many=True)
  return(Response(serializer.data))

@wait_for(timeout=10.0)
def runner(url):
    crawler_settings = Settings()
    configure_logging()
    crawler_settings.setmodule(my_settings)
    runner = CrawlerRunner(settings=crawler_settings)
    d = runner.crawl(MySpider, url=url)

my_spider.py

import scrapy
from scrapy.loader import ItemLoader
from itemloaders.processors import TakeFirst
from crawler.items import ScraperItem


class MySpider(scrapy.Spider):
    name = "myspider"

    def __init__(self, *args, **kwargs):
        link = kwargs.get('url')
        self.start_urls = [link]
        super().__init__(**kwargs)

    def start_requests(self):
        yield scrapy.Request(url=self.start_urls[0], callback=self.parse)

    def parse(self, response):
        do stuff

Can anyone guide me towards why this is happening and how I can solve it?

I'm not sure why that is, however I remember running into similar issues. Could you please alter your __init__ and start_requests methods to the following and tell me about the outcome:

    def __init__(self, *args, **kwargs):
        self.link = kwargs.get('url')
        super().__init__(**kwargs)

    def start_requests(self):
        yield scrapy.Request(url=self.link, callback=self.parse)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM