简体   繁体   English

创建类实例变量到scrapy spider

[英]Create class instance variable into scrapy spider

I am new to python. 我是python的新手。 I want to create my own class instance variable_1, variable_2 in to scrapy spider class. 我想在scrapy spider类中创建自己的类实例variable_1, variable_2 The following code is working good. 以下代码运行良好。

class SpiderTest1(scrapy.Spider):

    name       = 'main run'
    url        = 'url example'  # this class variable working find
    variable_1 = 'info_1'       # this class variable working find
    variable_2 = 'info_2'       # this class variable working find

    def start_requests(self):
        urls = [self.url]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        print (f'some process with {self.variable_1}')
        print (f'some prcesss with {self.variable_2}')


# start run the class
process = CrawlerProcess(get_project_settings())   
process.crawl(SpiderTest1())
process.start()

But I want to make it class instance variable, so that I do not have to modify variable's value inside spider everytime I run it. 但是我想让它成为类实例变量,这样我每次运行它时都不必在Spider内部修改变量的值。 I decide to create def __init__(self, url, varialbe_1, variable_2) into scrapy spider, and I expect to use SpiderTest1(url, variable_1, variable_2) to run it. 我决定将def __init__(self, url, varialbe_1, variable_2)到scrapy spider中,我希望使用SpiderTest1(url, variable_1, variable_2)来运行它。 The following is new code that I expect to result as the code above does, but this is not working good: 以下是我希望像上面的代码一样产生的新代码,但是效果不佳:

class SpiderTest1(scrapy.Spider):

    name = 'main run'

    # the following __init__ are new change, but not working fine
    def __init__(self, url, variable_1, variable_2):
        self.url = url                 
        self.variable_1 = variable_1
        self.variable_2 = variable_2

    def start_requests(self):
        urls = [self.url]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        print(f'some process with {self.variable_1}')
        print(f'some prcesss with {self.variable_2}')

# input values into variables
url        = 'url example'  
variable_1 = 'info_1'       
variable_2 = 'info_2' 

# start run the class
process = CrawlerProcess(get_project_settings())   
process.crawl(SpiderTest1(url, variable_1, variable_2)) #it seem this code doesn't work
process.start()

It result: 结果:

TypeError: __init__() missing 3 required positional arguments: 'url', 'variable_1', and 'variable_2'

Thank when anyone can tell how to achieve it. 谢谢大家能告诉我如何实现它。

According to Common Practices and API documentation , you should call the crawl method like this to pass arguments to the spider constructor: 根据通用做法API文档 ,您应该像这样调用crawl方法,以将参数传递给Spider构造函数:

process = CrawlerProcess(get_project_settings())   
process.crawl(SpiderTest1, url, variable_1, variable_2)
process.start()

UPDATE: The documentation also mentions this form of running the spider: 更新:文档还提到了运行蜘蛛的这种形式:

process.crawl('followall', domain='scrapinghub.com')

In this case, 'followall' is the name of the spider in the project (ie the value of name attribute of the spider class). 在这种情况下, 'followall'是项目中蜘蛛程序的名称(即蜘蛛程序类的name属性的值)。 In your specific case where you define the spider as follows: 在特定情况下,您可以按如下方式定义蜘蛛网:

class SpiderTest1(scrapy.Spider):
    name = 'main run'
    ...

you would use this code to run your spider using spider name: 您将使用以下代码通过蜘蛛名称运行蜘蛛:

process = CrawlerProcess(get_project_settings())   
process.crawl('main run', url, variable_1, variable_2)
process.start()

Thank, my code is working fine with your way. 谢谢,我的代码按您的方式正常工作。 But I find things slightly different from Common Practices 但我发现事情与一般做法略有不同

this is our code: 这是我们的代码:

process.crawl(SpiderTest1, url, variable_1, variable_2)

this is from Common Practices 这是从惯例

process.crawl('followall', domain='scrapinghub.com')

The first variable as your suggest is using class's name SpiderTest1 , but the other one uses string 'followall' 您建议的第一个变量使用类的名称SpiderTest1 ,但另一个使用字符串'followall'

What does 'followall' refer to? 'followall'指的是什么? It refers to directory: testspiders/testspiders/spiders/followall.py or just the class's variable name = 'followall' under followall.py 它指的是目录: testspiders/testspiders/spiders/followall.py或只是类的变量name = 'followall'followall.py

I am asking it because I am still confused when I should call string or class name in scrapy spider. 我之所以这样问,是因为我在应该用scrapy spider调用stringclass name时仍然感到困惑。

Thank. 谢谢。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM