简体   繁体   中英

scrapy.Request does not execute callback function to process custom URL

I would expect to see "HIT" in my Visual Studio console but the process_listing function is never executed. When I run scrapy crawl foo -O foo.json I get error:

start_requests = iter(self.spider.start_requests()) TypeError: 'NoneType' object is not iterable

I already checked here .

import json
import re
import os
import requests
import scrapy
import time
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
import html2text

class FooSpider(scrapy.Spider):
    name = 'foo'
    start_urls = ['https://www.example.com/item.json?lang=en']

    def start_requests(self):
        r = requests.get(self.start_urls[0])
        cont = r.json()
        self.parse(cont)

    def parse(self, response):
        for o in response['objects']:
            if o.get('option') == "buy" and o.get('is_available'):   
                listing_url = "https://www.example.com/" + \
                     o.get('brand').lower().replace(' ','-') + "-" + \
                     o.get('model').lower() + "-"
                if o.get('make') is not None:
                    listing_url += o.get('make') + "-"
                listing_url += o.get('year').lower() 
                print(listing_url) #a valid url is printed here

                yield scrapy.Request(
                    url=response.urljoin(listing_url), 
                    callback=self.process_listing
                )

    
    def process_listing(self, response):
        #this function is never executed
        print('HIT')
        yield item

I tried:

  • url=response.urljoin(listing_url)
  • url=listing_url

Looking at the documentation for sractpy.Spider.start_requests , we see:

This method must return an iterable with the first Requests to crawl for this spider. It is called by Scrapy when the spider is opened for scraping. Scrapy calls it only once, so it is safe to implement start_requests() as a generator.

(emphasis mine)

Your start_requests method doesn't return anything (aka it returns None ):

    def start_requests(self):
        r = requests.get(self.start_urls[0])
        cont = r.json()
        self.parse(cont)

So when scrapy calls iter(self.spider.start_requests() , it ends up asking for iter(None) , and None isn't iterable.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM