简体   繁体   中英

Create class instance variable into scrapy spider

I am new to python. I want to create my own class instance variable_1, variable_2 in to scrapy spider class. The following code is working good.

class SpiderTest1(scrapy.Spider):

    name       = 'main run'
    url        = 'url example'  # this class variable working find
    variable_1 = 'info_1'       # this class variable working find
    variable_2 = 'info_2'       # this class variable working find

    def start_requests(self):
        urls = [self.url]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        print (f'some process with {self.variable_1}')
        print (f'some prcesss with {self.variable_2}')


# start run the class
process = CrawlerProcess(get_project_settings())   
process.crawl(SpiderTest1())
process.start()

But I want to make it class instance variable, so that I do not have to modify variable's value inside spider everytime I run it. I decide to create def __init__(self, url, varialbe_1, variable_2) into scrapy spider, and I expect to use SpiderTest1(url, variable_1, variable_2) to run it. The following is new code that I expect to result as the code above does, but this is not working good:

class SpiderTest1(scrapy.Spider):

    name = 'main run'

    # the following __init__ are new change, but not working fine
    def __init__(self, url, variable_1, variable_2):
        self.url = url                 
        self.variable_1 = variable_1
        self.variable_2 = variable_2

    def start_requests(self):
        urls = [self.url]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        print(f'some process with {self.variable_1}')
        print(f'some prcesss with {self.variable_2}')

# input values into variables
url        = 'url example'  
variable_1 = 'info_1'       
variable_2 = 'info_2' 

# start run the class
process = CrawlerProcess(get_project_settings())   
process.crawl(SpiderTest1(url, variable_1, variable_2)) #it seem this code doesn't work
process.start()

It result:

TypeError: __init__() missing 3 required positional arguments: 'url', 'variable_1', and 'variable_2'

Thank when anyone can tell how to achieve it.

According to Common Practices and API documentation , you should call the crawl method like this to pass arguments to the spider constructor:

process = CrawlerProcess(get_project_settings())   
process.crawl(SpiderTest1, url, variable_1, variable_2)
process.start()

UPDATE: The documentation also mentions this form of running the spider:

process.crawl('followall', domain='scrapinghub.com')

In this case, 'followall' is the name of the spider in the project (ie the value of name attribute of the spider class). In your specific case where you define the spider as follows:

class SpiderTest1(scrapy.Spider):
    name = 'main run'
    ...

you would use this code to run your spider using spider name:

process = CrawlerProcess(get_project_settings())   
process.crawl('main run', url, variable_1, variable_2)
process.start()

Thank, my code is working fine with your way. But I find things slightly different from Common Practices

this is our code:

process.crawl(SpiderTest1, url, variable_1, variable_2)

this is from Common Practices

process.crawl('followall', domain='scrapinghub.com')

The first variable as your suggest is using class's name SpiderTest1 , but the other one uses string 'followall'

What does 'followall' refer to? It refers to directory: testspiders/testspiders/spiders/followall.py or just the class's variable name = 'followall' under followall.py

I am asking it because I am still confused when I should call string or class name in scrapy spider.

Thank.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM