[英]yield scrapy.Request does not invoke parse function on each iteration
在我的代碼中,我必須在 scrapy class 中運行。start_request 從 excel 工作簿中獲取數據並將值plate_num_xlsx
變量。
def start_requests(self):
df=pd.read_excel('data.xlsx')
columnA_values=df['PLATE']
for row in columnA_values:
global plate_num_xlsx
plate_num_xlsx=row
print("+",plate_num_xlsx)
base_url =f"https://dvlaregistrations.dvla.gov.uk/search/results.html?search={plate_num_xlsx}&action=index&pricefrom=0&priceto=&prefixmatches=¤tmatches=&limitprefix=&limitcurrent=&limitauction=&searched=true&openoption=&language=en&prefix2=Search&super=&super_pricefrom=&super_priceto="
url=base_url
yield scrapy.Request(url,callback=self.parse)
但是,在每次迭代中,它應該調用 Scrapy class 的 parse() 方法,並且在 function 內部與plate_num_xlsx
的每個迭代新值需要比較解析的值,正如我在 print 語句之后理解的那樣,它首先獲取所有值,然后僅分配它們分配的最后一個值調用 parse() 方法。 但是對於我的爬蟲到 function 正確我需要每次分配碰巧調用和使用 def parse() 中的值。 代碼如下;
import scrapy
from scrapy.crawler import CrawlerProcess
import pandas as pd
itemList=[]
class plateScraper(scrapy.Spider):
name = 'scrapePlate'
allowed_domains = ['dvlaregistrations.dvla.gov.uk']
def start_requests(self):
df=pd.read_excel('data.xlsx')
columnA_values=df['PLATE']
for row in columnA_values:
global plate_num_xlsx
plate_num_xlsx=row
print("+",plate_num_xlsx)
base_url =f"https://dvlaregistrations.dvla.gov.uk/search/results.html?search={plate_num_xlsx}&action=index&pricefrom=0&priceto=&prefixmatches=¤tmatches=&limitprefix=&limitcurrent=&limitauction=&searched=true&openoption=&language=en&prefix2=Search&super=&super_pricefrom=&super_priceto="
url=base_url
yield scrapy.Request(url,callback=self.parse)
def parse(self, response):
for row in response.css('div.resultsstrip'):
plate = row.css('a::text').get()
price = row.css('p::text').get()
a = plate.replace(" ", "").strip()
print(plate_num_xlsx,a,a == plate_num_xlsx)
if plate_num_xlsx==plate.replace(" ","").strip():
item= {"plate": plate.strip(), "price": price.strip()}
itemList.append(item)
yield item
else:
item = {"plate": plate_num_xlsx, "price": "-"}
itemList.append(item)
yield item
with pd.ExcelWriter('output_res.xlsx', mode='r+',if_sheet_exists='overlay') as writer:
df_output = pd.DataFrame(itemList)
df_output.to_excel(writer, sheet_name='result', index=False, header=True)
process = CrawlerProcess()
process.crawl(plateScraper)
process.start()
由於它的異步運行時行為,以這種方式將全局變量與 scrapy 一起使用將不起作用。 或者,您可以將plate_num_xlsx
變量作為回調關鍵字參數傳遞給請求 object 本身。
例如:
plate_num_xlsx=row
base_url =f"https://dvlaregistrations.dvla.gov.uk/search/results.html?search={plate_num_xlsx}&action=index&pricefrom=0&priceto=&prefixmatches=¤tmatches=&limitprefix=&limitcurrent=&limitauction=&searched=true&openoption=&language=en&prefix2=Search&super=&super_pricefrom=&super_priceto="
url=base_url
yield scrapy.Request(url,callback=self.parse, cb_kwargs={'plate_num_xlsx': plate_num_xlsx})
def parse(self, response, plate_num_xlsx=None):
for row in response.css('div.resultsstrip'):
plate = row.css('a::text').get()
price = row.css('p::text').get()
...
現在變量將作為參數包含在解析 function 中。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.