![](/img/trans.png)
[英]the order of Scrapy Crawling URLs with long start_urls list and urls yiels from spider
[英]Dynamic start-urls list when crawling with scrapy
class SomewebsiteProductSpider(scrapy.Spider):
name = "somewebsite"
allowed_domains = ["somewebsite.com"]
start_urls = [
]
def parse(self, response):
items = somewebsiteItem()
title = response.xpath('//h1[@id="title"]/span/text()').extract()
sale_price = response.xpath('//span[contains(@id,"ourprice") or contains(@id,"saleprice")]/text()').extract()
category = response.xpath('//a[@class="a-link-normal a-color-tertiary"]/text()').extract()
availability = response.xpath('//div[@id="availability"]//text()').extract()
items['product_name'] = ''.join(title).strip()
items['product_sale_price'] = ''.join(sale_price).strip()
items['product_category'] = ','.join(map(lambda x: x.strip(), category)).strip()
items['product_availability'] = ''.join(availability).strip()
fo = open("C:\\Users\\user1\PycharmProjects\\test.txt", "w")
fo.write("%s \n%s \n%s" % (items['product_name'], items['product_sale_price'], self.start_urls))
fo.close()
print(items)
yield items
測試文件
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process.crawl(SomewebsiteProductSpider)
process.start()
在啟動爬網過程之前,如何將動態 start_urls 列表傳遞給 test.py 中的“SomewebsiteProductSpiders”對象? 任何幫助,將不勝感激。 謝謝你。
process.crawl
接受傳遞給蜘蛛構造函數的可選參數,因此您可以從蜘蛛的__init__
填充start_urls
或使用自定義start_requests
過程。 例如
測試文件
...
process.crawl(SomewebsiteProductSpider, url_list=[...])
一些蜘蛛.py
class SomewebsiteProductSpider(scrapy.Spider):
...
def __init__(self, *args, **kwargs):
self.start_urls = kwargs.pop('url_list', [])
super(SomewebsiteProductSpider, *args, **kwargs)
只需將 start_urls 作為參數傳遞,您就可以避免從 @mizghun 的答案中解析額外的 kwargs。
import scrapy
from scrapy.crawler import CrawlerProcess
class QuotesSpider(scrapy.Spider):
name = 'quotes'
def parse(self, response):
print(response.url)
process = CrawlerProcess()
process.crawl(QuotesSpider, start_urls=["http://example.com", "http://example.org"])
process.start()
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.