[英]how to pass an argument into a scrapy spider and init it from within python
i am trying to pass a variable screen_name
to my spider because this screen_name will change everytime. 我正在尝试将变量
screen_name
传递给我的蜘蛛,因为此screen_name每次都会更改。 ( the end goal is to have multiple spiders running with different screen_names) (最终目标是让多个蜘蛛以不同的screen_names运行)
i initialise like this 我这样初始化
process.crawl(TwitterSpider(screen_name="realDonaldTrump"))
However i get the following error. 但是我得到以下错误。
spider = cls(*args, **kwargs) TypeError: init () missing 1 required positional argument: 'screen_name'
spider = cls(* args,** kwargs)TypeError: init ()缺少1个必需的位置参数:'screen_name'
import scrapy
from scrapy.crawler import CrawlerProcess
class TwitterSpider(scrapy.Spider):
name = "twitter_friends"
def __init__(self, screen_name, *args, **kwargs):
self.usernames = []
self.screen_name = screen_name
super().__init__(**kwargs)
def start_requests(self):
base_url = "https://mobile.twitter.com"
urls = [
base_url + '/{screen_name}/following'.format(screen_name=self.screen_name,
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def closed(self, spider):
print("spider closed")
def parse(self, response):
pass
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process.crawl(TwitterSpider(screen_name="realDonaldTrump"))
process.start() # the script will block here until the crawling is finished
This is not a question about how to run it from cmd line but only from within python 这不是关于如何从cmd行运行的问题,而仅是从python内部运行的问题
You can pass the spider class and its arguments to the crawl
method. 您可以将Spider类及其参数传递给
crawl
方法。 Eg: 例如:
process.crawl(TwitterSpider, screen_name="realDonaldTrump")
process.start()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.