yield scrapy.Request 不会在每次迭代时调用 parse function

Question

in my code i have to functions inside scrapy class. start_request takes data from excel workbook and assigns value to plate_num_xlsx variable.在我的代码中，我必须在 scrapy class 中运行。start_request 从 excel 工作簿中获取数据并将值plate_num_xlsx变量。

def start_requests(self):
    df=pd.read_excel('data.xlsx')
    columnA_values=df['PLATE']
    for row in columnA_values:
        global  plate_num_xlsx
        plate_num_xlsx=row
        print("+",plate_num_xlsx)
        base_url =f"https://dvlaregistrations.dvla.gov.uk/search/results.html?search={plate_num_xlsx}&action=index&pricefrom=0&priceto=&prefixmatches=&currentmatches=&limitprefix=&limitcurrent=&limitauction=&searched=true&openoption=&language=en&prefix2=Search&super=&super_pricefrom=&super_priceto="
        url=base_url
        yield scrapy.Request(url,callback=self.parse)

But, on each iteration it should invoke parse() method of Scrapy class and inside that function with each iterated newly value of plate_num_xlsx needs to compare the parsed value, As I understood after print statements it first takes all values, assigns them then only with last value assigned calls parse() method.但是，在每次迭代中，它应该调用 Scrapy class 的 parse() 方法，并且在 function 内部与plate_num_xlsx的每个迭代新值需要比较解析的值，正如我在 print 语句之后理解的那样，它首先获取所有值，然后仅分配它们分配的最后一个值调用 parse() 方法。 But for my crawler to function properly I need each time assigning happens to call and use that value inside def parse().但是对于我的爬虫到 function 正确我需要每次分配碰巧调用和使用 def parse() 中的值。 code is below;代码如下；

import scrapy
from scrapy.crawler import CrawlerProcess
import pandas as pd

itemList=[]
class plateScraper(scrapy.Spider):
    name = 'scrapePlate'
    allowed_domains = ['dvlaregistrations.dvla.gov.uk']

    def start_requests(self):
        df=pd.read_excel('data.xlsx')
        columnA_values=df['PLATE']
        for row in columnA_values:
            global  plate_num_xlsx
            plate_num_xlsx=row
            print("+",plate_num_xlsx)
            base_url =f"https://dvlaregistrations.dvla.gov.uk/search/results.html?search={plate_num_xlsx}&action=index&pricefrom=0&priceto=&prefixmatches=&currentmatches=&limitprefix=&limitcurrent=&limitauction=&searched=true&openoption=&language=en&prefix2=Search&super=&super_pricefrom=&super_priceto="
            url=base_url
            yield scrapy.Request(url,callback=self.parse)

    def parse(self, response):

        for row in response.css('div.resultsstrip'):
            plate = row.css('a::text').get()
            price = row.css('p::text').get()
            a = plate.replace(" ", "").strip()
            print(plate_num_xlsx,a,a == plate_num_xlsx)
            if plate_num_xlsx==plate.replace(" ","").strip():
                item= {"plate": plate.strip(), "price": price.strip()}
                itemList.append(item)
                yield  item
            else:
                item = {"plate": plate_num_xlsx, "price": "-"}
                itemList.append(item)
                yield item

        with pd.ExcelWriter('output_res.xlsx', mode='r+',if_sheet_exists='overlay') as writer:
            df_output = pd.DataFrame(itemList)
            df_output.to_excel(writer, sheet_name='result', index=False, header=True)

process = CrawlerProcess()
process.crawl(plateScraper)
process.start()

Answer 1

Using global variables with scrapy in this manner will not work due to it's asynchronous runtime behavior.由于它的异步运行时行为，以这种方式将全局变量与 scrapy 一起使用将不起作用。 What you can do alternatively is pass the plate_num_xlsx variable as a callback keyword argument to the request object itself.或者，您可以将plate_num_xlsx变量作为回调关键字参数传递给请求 object 本身。

for example:例如：

            plate_num_xlsx=row
            base_url =f"https://dvlaregistrations.dvla.gov.uk/search/results.html?search={plate_num_xlsx}&action=index&pricefrom=0&priceto=&prefixmatches=&currentmatches=&limitprefix=&limitcurrent=&limitauction=&searched=true&openoption=&language=en&prefix2=Search&super=&super_pricefrom=&super_priceto="
            url=base_url
            yield scrapy.Request(url,callback=self.parse, cb_kwargs={'plate_num_xlsx': plate_num_xlsx})



    def parse(self, response, plate_num_xlsx=None):
        for row in response.css('div.resultsstrip'):
            plate = row.css('a::text').get()
            price = row.css('p::text').get()
            ...

Now the variable will be included as a parameter to the parse function.现在变量将作为参数包含在解析 function 中。

yield scrapy.Request 不会在每次迭代时调用 parse function

问题描述

1 个解决方案

解决方案1
1 已采纳 2023-01-16 20:07:06

yield scrapy.Request 不会在每次迭代时调用 parse function

问题描述

1 个解决方案

解决方案1 1 已采纳 2023-01-16 20:07:06

解决方案1
1 已采纳 2023-01-16 20:07:06