简体   繁体   中英

Info on Scrapy CONCURRENT_REQUESTS in Python

I'm using Scrapy and I read on the doc about the setting "CONCURRENT_REQUESTS". He speak about "The maximum number of concurrent (ie. simultaneous) requests that will be performed by the Scrapy downloader."

I created a spider in order to take questions and answers from Q&A websites, so i want to know if is possibile run multiple concurrent request. Now I have set this value to 1, because I don't want to loose some Item or override someone. The main doubt is that i have a Global ID idQuestion (for make a idQuestion.idAnswer) for any item do i don't know if making multiple requests all can be a mess and loose some Item o set wrong Ids.

This is a snippet of code:

class Scraper(scrapy.Spider):
    uid = 1


    def parse_page(self, response):
        # Scraping a single question

        item = ScrapeItem()
        hxs = HtmlXPathSelector(response)
        #item['date_time'] = response.meta['data']
        item['type'] = "Question"
        item['uid'] = str(self.uid)
        item['url'] = response.url

        #Do some scraping.
        ans_uid = ans_uid + 1
        item['uid'] = str(str(self.uid) + (":" + str(ans_uid)))
        yield item

        #Call recusivly the method on other page.
        print("NEXT -> "+str(composed_string))
        yield scrapy.Request(composed_string, callback=self.parse_page)

This is the skeleton of my code. I use uid for memorize the id for the single question and ans_uid for the answer. Ex:

1) Question

1.1) Ans 1 for Question 1

1.2) Ans 2 for Question 1

1.3) Ans 3 for Question 1

**Can I simply increase the CONCURRENT_REQUESTS value? without compromise anything? **

The answer to your question is: no . If you increase the concurrent requests you can end up having different values for uid -- even if the question is the same later. That's because there is no guarantee that your requests are handled in order.

However you can pass information along your Request objects with the meta attribute. I would pass along the ID with the yield Request(... as a meta tag and then look in the parse_page if this attribute is available or not. If it is not then it is a new question, if yes, use this id because this is not a new question.

You can read more about meta here: http://doc.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Request.meta

Scrapy is not a multithreaded environment , but rather uses an event loop driven asynchronous architecture (Twisted, which is a bit like node.js for python ).

in that sense, it is completely thread safe.

You actually have a reference to the request object as response -> response.request, which has response.request.url, as well as the referer header sent, and response.request.meta so you have mapping from answers back to questions built in. (like a referrer header of sorts) if you are reading from a list of questions or answers from a single page, you are guaranteed that those questions and answers will be read in order.

you can do something like the following:

class mySpider(Spider):
    def parse_answer(self, response):
        question_url = response.request.headers.get('Referer', None)
        yield Answer(question_url = ..., answerinfo = ... )

class Answer(item):
    answer = ....
    question_url = ...

Hope that helps.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM