def start_requests(self):
db = SeedUserGenerator()
result = db.selectSeedUsers()
db.closeDB()
urls = []
for name in result:
urls.append(self.user_info_url.format(name))
for url in urls:
yield Request(url=url, callback=self.parse_user, dont_filter=False, priority=10)
print('fin')
def parse_user(self, response):
.........ignore some code here...........
yield Request(url=next_url, priority=20, callback=self.parse_info)
def parse_info(self, response):
.........ignore some code here...........
yield Request(url=next_url, priority=30, callback=self.parse_user)
The program runs as follows:
start_requests
, and the function start_requests
seems to be paused without outputing the string fin
.parse_user
yield another Request, but the remaining Requests in the function start_requests
can not be yield until the response has been processed, and here the yield operation formed a ring. It seems to be synchronous: Before sending a Request from start_requests
and processing its response, other Requests can not be yield?
Is that mean scrapy can never yield the remaining Requests in the function start_requests
?
How could I make scrapy finish running start_requests
first?
I'm new in python and scrapy. Can scrapy process a response and yield Requests at the same time?
By the way, I'm using Python3.6 and Scrapy1.5.1 Twisted 20.3.0
I solved my problem by referring to the source code of Scrapy engine:
def _next_request(self, spider):
slot = self.slot
if not slot:
return
if self.paused:
return
while not self._needs_backout(spider):
if not self._next_request_from_scheduler(spider):
break
if slot.start_requests and not self._needs_backout(spider):
try:
request = next(slot.start_requests)
except StopIteration:
slot.start_requests = None
except Exception:
slot.start_requests = None
logger.error('Error while obtaining start requests',
exc_info=True, extra={'spider': spider})
else:
self.crawl(request, spider)
if self.spider_is_idle(spider) and slot.close_if_idle:
self._spider_idle(spider)
Here Scrapy always tries to get requests from scheduler's queues first, rather than start_requests.
What's more, Scrapy never put all requests of function start_requests
first.
So, I change my code like this:
def start_requests(self):
db = SeedUserGenerator()
result = db.selectSeedUsers()
db.closeDB()
urls = []
for name in result:
urls.append(self.user_info_url.format(name))
yield Request(url=urls[0], callback=self.parse_temp, dont_filter=True, priority=10, meta={'urls': urls})
def parse_temp(self, response):
urls = response.meta['urls']
for url in urls:
print(url)
yield Request(url=url, callback=self.parse_user, dont_filter=False, priority=10)
print('fin2')
Then Scrapy put all requests into the queues first.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.