I am new to python. I want to create my own class instance variable_1, variable_2
in to scrapy spider class. The following code is working good.
class SpiderTest1(scrapy.Spider):
name = 'main run'
url = 'url example' # this class variable working find
variable_1 = 'info_1' # this class variable working find
variable_2 = 'info_2' # this class variable working find
def start_requests(self):
urls = [self.url]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
print (f'some process with {self.variable_1}')
print (f'some prcesss with {self.variable_2}')
# start run the class
process = CrawlerProcess(get_project_settings())
process.crawl(SpiderTest1())
process.start()
But I want to make it class instance variable, so that I do not have to modify variable's value inside spider everytime I run it. I decide to create def __init__(self, url, varialbe_1, variable_2)
into scrapy spider, and I expect to use SpiderTest1(url, variable_1, variable_2)
to run it. The following is new code that I expect to result as the code above does, but this is not working good:
class SpiderTest1(scrapy.Spider):
name = 'main run'
# the following __init__ are new change, but not working fine
def __init__(self, url, variable_1, variable_2):
self.url = url
self.variable_1 = variable_1
self.variable_2 = variable_2
def start_requests(self):
urls = [self.url]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
print(f'some process with {self.variable_1}')
print(f'some prcesss with {self.variable_2}')
# input values into variables
url = 'url example'
variable_1 = 'info_1'
variable_2 = 'info_2'
# start run the class
process = CrawlerProcess(get_project_settings())
process.crawl(SpiderTest1(url, variable_1, variable_2)) #it seem this code doesn't work
process.start()
It result:
TypeError: __init__() missing 3 required positional arguments: 'url', 'variable_1', and 'variable_2'
Thank when anyone can tell how to achieve it.
According to Common Practices and API documentation , you should call the crawl
method like this to pass arguments to the spider constructor:
process = CrawlerProcess(get_project_settings())
process.crawl(SpiderTest1, url, variable_1, variable_2)
process.start()
UPDATE: The documentation also mentions this form of running the spider:
process.crawl('followall', domain='scrapinghub.com')
In this case, 'followall'
is the name of the spider in the project (ie the value of name
attribute of the spider class). In your specific case where you define the spider as follows:
class SpiderTest1(scrapy.Spider):
name = 'main run'
...
you would use this code to run your spider using spider name:
process = CrawlerProcess(get_project_settings())
process.crawl('main run', url, variable_1, variable_2)
process.start()
Thank, my code is working fine with your way. But I find things slightly different from Common Practices
this is our code:
process.crawl(SpiderTest1, url, variable_1, variable_2)
this is from Common Practices
process.crawl('followall', domain='scrapinghub.com')
The first variable as your suggest is using class's name SpiderTest1
, but the other one uses string 'followall'
What does 'followall'
refer to? It refers to directory: testspiders/testspiders/spiders/followall.py
or just the class's variable name = 'followall'
under followall.py
I am asking it because I am still confused when I should call string
or class name
in scrapy spider.
Thank.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.