[英]Create class instance variable into scrapy spider
我是python的新手。 我想在scrapy spider類中創建自己的類實例variable_1, variable_2
。 以下代碼運行良好。
class SpiderTest1(scrapy.Spider):
name = 'main run'
url = 'url example' # this class variable working find
variable_1 = 'info_1' # this class variable working find
variable_2 = 'info_2' # this class variable working find
def start_requests(self):
urls = [self.url]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
print (f'some process with {self.variable_1}')
print (f'some prcesss with {self.variable_2}')
# start run the class
process = CrawlerProcess(get_project_settings())
process.crawl(SpiderTest1())
process.start()
但是我想讓它成為類實例變量,這樣我每次運行它時都不必在Spider內部修改變量的值。 我決定將def __init__(self, url, varialbe_1, variable_2)
到scrapy spider中,我希望使用SpiderTest1(url, variable_1, variable_2)
來運行它。 以下是我希望像上面的代碼一樣產生的新代碼,但是效果不佳:
class SpiderTest1(scrapy.Spider):
name = 'main run'
# the following __init__ are new change, but not working fine
def __init__(self, url, variable_1, variable_2):
self.url = url
self.variable_1 = variable_1
self.variable_2 = variable_2
def start_requests(self):
urls = [self.url]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
print(f'some process with {self.variable_1}')
print(f'some prcesss with {self.variable_2}')
# input values into variables
url = 'url example'
variable_1 = 'info_1'
variable_2 = 'info_2'
# start run the class
process = CrawlerProcess(get_project_settings())
process.crawl(SpiderTest1(url, variable_1, variable_2)) #it seem this code doesn't work
process.start()
結果:
TypeError: __init__() missing 3 required positional arguments: 'url', 'variable_1', and 'variable_2'
謝謝大家能告訴我如何實現它。
根據通用做法和API文檔 ,您應該像這樣調用crawl
方法,以將參數傳遞給Spider構造函數:
process = CrawlerProcess(get_project_settings())
process.crawl(SpiderTest1, url, variable_1, variable_2)
process.start()
更新:文檔還提到了運行蜘蛛的這種形式:
process.crawl('followall', domain='scrapinghub.com')
在這種情況下, 'followall'
是項目中蜘蛛程序的名稱(即蜘蛛程序類的name
屬性的值)。 在特定情況下,您可以按如下方式定義蜘蛛網:
class SpiderTest1(scrapy.Spider):
name = 'main run'
...
您將使用以下代碼通過蜘蛛名稱運行蜘蛛:
process = CrawlerProcess(get_project_settings())
process.crawl('main run', url, variable_1, variable_2)
process.start()
謝謝,我的代碼按您的方式正常工作。 但我發現事情與一般做法略有不同
這是我們的代碼:
process.crawl(SpiderTest1, url, variable_1, variable_2)
這是從慣例
process.crawl('followall', domain='scrapinghub.com')
您建議的第一個變量使用類的名稱SpiderTest1
,但另一個使用字符串'followall'
'followall'
指的是什么? 它指的是目錄: testspiders/testspiders/spiders/followall.py
或只是類的變量name = 'followall'
下followall.py
我之所以這樣問,是因為我在應該用scrapy spider調用string
或class name
時仍然感到困惑。
謝謝。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.