[英]how to pass an argument into a scrapy spider and init it from within python
我正在嘗試將變量screen_name
傳遞給我的蜘蛛,因為此screen_name每次都會更改。 (最終目標是讓多個蜘蛛以不同的screen_names運行)
我這樣初始化
process.crawl(TwitterSpider(screen_name="realDonaldTrump"))
但是我得到以下錯誤。
spider = cls(* args,** kwargs)TypeError: init ()缺少1個必需的位置參數:'screen_name'
import scrapy
from scrapy.crawler import CrawlerProcess
class TwitterSpider(scrapy.Spider):
name = "twitter_friends"
def __init__(self, screen_name, *args, **kwargs):
self.usernames = []
self.screen_name = screen_name
super().__init__(**kwargs)
def start_requests(self):
base_url = "https://mobile.twitter.com"
urls = [
base_url + '/{screen_name}/following'.format(screen_name=self.screen_name,
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def closed(self, spider):
print("spider closed")
def parse(self, response):
pass
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process.crawl(TwitterSpider(screen_name="realDonaldTrump"))
process.start() # the script will block here until the crawling is finished
這不是關於如何從cmd行運行的問題,而僅是從python內部運行的問題
您可以將Spider類及其參數傳遞給crawl
方法。 例如:
process.crawl(TwitterSpider, screen_name="realDonaldTrump")
process.start()
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.