[英]Create class instance variable into scrapy spider
我是python的新手。 我想在scrapy spider类中创建自己的类实例variable_1, variable_2
。 以下代码运行良好。
class SpiderTest1(scrapy.Spider):
name = 'main run'
url = 'url example' # this class variable working find
variable_1 = 'info_1' # this class variable working find
variable_2 = 'info_2' # this class variable working find
def start_requests(self):
urls = [self.url]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
print (f'some process with {self.variable_1}')
print (f'some prcesss with {self.variable_2}')
# start run the class
process = CrawlerProcess(get_project_settings())
process.crawl(SpiderTest1())
process.start()
但是我想让它成为类实例变量,这样我每次运行它时都不必在Spider内部修改变量的值。 我决定将def __init__(self, url, varialbe_1, variable_2)
到scrapy spider中,我希望使用SpiderTest1(url, variable_1, variable_2)
来运行它。 以下是我希望像上面的代码一样产生的新代码,但是效果不佳:
class SpiderTest1(scrapy.Spider):
name = 'main run'
# the following __init__ are new change, but not working fine
def __init__(self, url, variable_1, variable_2):
self.url = url
self.variable_1 = variable_1
self.variable_2 = variable_2
def start_requests(self):
urls = [self.url]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
print(f'some process with {self.variable_1}')
print(f'some prcesss with {self.variable_2}')
# input values into variables
url = 'url example'
variable_1 = 'info_1'
variable_2 = 'info_2'
# start run the class
process = CrawlerProcess(get_project_settings())
process.crawl(SpiderTest1(url, variable_1, variable_2)) #it seem this code doesn't work
process.start()
结果:
TypeError: __init__() missing 3 required positional arguments: 'url', 'variable_1', and 'variable_2'
谢谢大家能告诉我如何实现它。
根据通用做法和API文档 ,您应该像这样调用crawl
方法,以将参数传递给Spider构造函数:
process = CrawlerProcess(get_project_settings())
process.crawl(SpiderTest1, url, variable_1, variable_2)
process.start()
更新:文档还提到了运行蜘蛛的这种形式:
process.crawl('followall', domain='scrapinghub.com')
在这种情况下, 'followall'
是项目中蜘蛛程序的名称(即蜘蛛程序类的name
属性的值)。 在特定情况下,您可以按如下方式定义蜘蛛网:
class SpiderTest1(scrapy.Spider):
name = 'main run'
...
您将使用以下代码通过蜘蛛名称运行蜘蛛:
process = CrawlerProcess(get_project_settings())
process.crawl('main run', url, variable_1, variable_2)
process.start()
谢谢,我的代码按您的方式正常工作。 但我发现事情与一般做法略有不同
这是我们的代码:
process.crawl(SpiderTest1, url, variable_1, variable_2)
这是从惯例
process.crawl('followall', domain='scrapinghub.com')
您建议的第一个变量使用类的名称SpiderTest1
,但另一个使用字符串'followall'
'followall'
指的是什么? 它指的是目录: testspiders/testspiders/spiders/followall.py
或只是类的变量name = 'followall'
下followall.py
我之所以这样问,是因为我在应该用scrapy spider调用string
或class name
时仍然感到困惑。
谢谢。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.