i am trying to pass a variable screen_name
to my spider because this screen_name will change everytime. ( the end goal is to have multiple spiders running with different screen_names)
i initialise like this
process.crawl(TwitterSpider(screen_name="realDonaldTrump"))
However i get the following error.
spider = cls(*args, **kwargs) TypeError: init () missing 1 required positional argument: 'screen_name'
import scrapy
from scrapy.crawler import CrawlerProcess
class TwitterSpider(scrapy.Spider):
name = "twitter_friends"
def __init__(self, screen_name, *args, **kwargs):
self.usernames = []
self.screen_name = screen_name
super().__init__(**kwargs)
def start_requests(self):
base_url = "https://mobile.twitter.com"
urls = [
base_url + '/{screen_name}/following'.format(screen_name=self.screen_name,
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def closed(self, spider):
print("spider closed")
def parse(self, response):
pass
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process.crawl(TwitterSpider(screen_name="realDonaldTrump"))
process.start() # the script will block here until the crawling is finished
This is not a question about how to run it from cmd line but only from within python
You can pass the spider class and its arguments to the crawl
method. Eg:
process.crawl(TwitterSpider, screen_name="realDonaldTrump")
process.start()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.