简体   繁体   English

scrapy.Request 不执行回调函数处理自定义 URL

[英]scrapy.Request does not execute callback function to process custom URL

I would expect to see "HIT" in my Visual Studio console but the process_listing function is never executed.我希望在我的 Visual Studio 控制台中看到“HIT”,但从未执行过process_listing函数。 When I run scrapy crawl foo -O foo.json I get error:当我运行scrapy crawl foo -O foo.json时出现错误:

start_requests = iter(self.spider.start_requests()) TypeError: 'NoneType' object is not iterable start_requests = iter(self.spider.start_requests()) TypeError: 'NoneType' 对象不可迭代

I already checked here .我已经在这里检查过了。

import json
import re
import os
import requests
import scrapy
import time
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
import html2text

class FooSpider(scrapy.Spider):
    name = 'foo'
    start_urls = ['https://www.example.com/item.json?lang=en']

    def start_requests(self):
        r = requests.get(self.start_urls[0])
        cont = r.json()
        self.parse(cont)

    def parse(self, response):
        for o in response['objects']:
            if o.get('option') == "buy" and o.get('is_available'):   
                listing_url = "https://www.example.com/" + \
                     o.get('brand').lower().replace(' ','-') + "-" + \
                     o.get('model').lower() + "-"
                if o.get('make') is not None:
                    listing_url += o.get('make') + "-"
                listing_url += o.get('year').lower() 
                print(listing_url) #a valid url is printed here

                yield scrapy.Request(
                    url=response.urljoin(listing_url), 
                    callback=self.process_listing
                )

    
    def process_listing(self, response):
        #this function is never executed
        print('HIT')
        yield item

I tried:我试过:

  • url=response.urljoin(listing_url)
  • url=listing_url

Looking at the documentation for sractpy.Spider.start_requests , we see:查看sractpy.Spider.start_requests的文档,我们看到:

This method must return an iterable with the first Requests to crawl for this spider.此方法必须返回一个迭代器,其中包含要为该蜘蛛抓取的第一个请求。 It is called by Scrapy when the spider is opened for scraping.当蜘蛛打开进行抓取时,它会被 Scrapy 调用。 Scrapy calls it only once, so it is safe to implement start_requests() as a generator. Scrapy 只调用一次,因此将 start_requests() 实现为生成器是安全的。

(emphasis mine) (强调我的)

Your start_requests method doesn't return anything (aka it returns None ):您的start_requests方法不返回任何内容(也就是返回None ):

    def start_requests(self):
        r = requests.get(self.start_urls[0])
        cont = r.json()
        self.parse(cont)

So when scrapy calls iter(self.spider.start_requests() , it ends up asking for iter(None) , and None isn't iterable.因此,当 scrapy 调用iter(self.spider.start_requests()时,它最终会请求iter(None) ,而None是不可迭代的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM