創建類實例變量到scrapy spider

Question

我是python的新手。 我想在scrapy spider類中創建自己的類實例variable_1, variable_2 。 以下代碼運行良好。

class SpiderTest1(scrapy.Spider):

    name       = 'main run'
    url        = 'url example'  # this class variable working find
    variable_1 = 'info_1'       # this class variable working find
    variable_2 = 'info_2'       # this class variable working find

    def start_requests(self):
        urls = [self.url]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        print (f'some process with {self.variable_1}')
        print (f'some prcesss with {self.variable_2}')


# start run the class
process = CrawlerProcess(get_project_settings())   
process.crawl(SpiderTest1())
process.start()

但是我想讓它成為類實例變量，這樣我每次運行它時都不必在Spider內部修改變量的值。 我決定將def __init__(self, url, varialbe_1, variable_2)到scrapy spider中，我希望使用SpiderTest1(url, variable_1, variable_2)來運行它。 以下是我希望像上面的代碼一樣產生的新代碼，但是效果不佳：

class SpiderTest1(scrapy.Spider):

    name = 'main run'

    # the following __init__ are new change, but not working fine
    def __init__(self, url, variable_1, variable_2):
        self.url = url                 
        self.variable_1 = variable_1
        self.variable_2 = variable_2

    def start_requests(self):
        urls = [self.url]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        print(f'some process with {self.variable_1}')
        print(f'some prcesss with {self.variable_2}')

# input values into variables
url        = 'url example'  
variable_1 = 'info_1'       
variable_2 = 'info_2' 

# start run the class
process = CrawlerProcess(get_project_settings())   
process.crawl(SpiderTest1(url, variable_1, variable_2)) #it seem this code doesn't work
process.start()

結果：

TypeError: __init__() missing 3 required positional arguments: 'url', 'variable_1', and 'variable_2'

謝謝大家能告訴我如何實現它。

Answer 1

根據通用做法和API文檔，您應該像這樣調用crawl方法，以將參數傳遞給Spider構造函數：

process = CrawlerProcess(get_project_settings())   
process.crawl(SpiderTest1, url, variable_1, variable_2)
process.start()

更新：文檔還提到了運行蜘蛛的這種形式：

process.crawl('followall', domain='scrapinghub.com')

在這種情況下， 'followall'是項目中蜘蛛程序的名稱（即蜘蛛程序類的name屬性的值）。 在特定情況下，您可以按如下方式定義蜘蛛網：

class SpiderTest1(scrapy.Spider):
    name = 'main run'
    ...

您將使用以下代碼通過蜘蛛名稱運行蜘蛛：

process = CrawlerProcess(get_project_settings())   
process.crawl('main run', url, variable_1, variable_2)
process.start()

Answer 2

謝謝，我的代碼按您的方式正常工作。 但我發現事情與一般做法略有不同

這是我們的代碼：

process.crawl(SpiderTest1, url, variable_1, variable_2)

這是從慣例

process.crawl('followall', domain='scrapinghub.com')

您建議的第一個變量使用類的名稱SpiderTest1 ，但另一個使用字符串'followall'

'followall'指的是什么？ 它指的是目錄： testspiders/testspiders/spiders/followall.py或只是類的變量name = 'followall'下followall.py

我之所以這樣問，是因為我在應該用scrapy spider調用string或class name時仍然感到困惑。

謝謝。

創建類實例變量到scrapy spider

問題描述

2 個解決方案

解決方案1
0 2019-03-08 16:51:07

解決方案2
0 2019-03-09 06:12:40

創建類實例變量到scrapy spider

問題描述

2 個解決方案

解決方案1 0 2019-03-08 16:51:07

解決方案2 0 2019-03-09 06:12:40

解決方案1
0 2019-03-08 16:51:07

解決方案2
0 2019-03-09 06:12:40