简体   繁体   中英

How to make a loop in return so that not to repeat scrapy.request?

I am scraping a page. I tried to make loop in return function but it didn't work. It gave me the result of just first link. I want to make a loop so that I could return all three values.

class SiteFetching(scrapy.Spider):
    name = 'Site'

    def start_requests(self):
        links = {'transcription_page': 'https://www.rev.com/freelancers/transcription',
                 'captions_page': 'https://www.rev.com/freelancers/captions',
                 'subtitles_page': 'https://www.rev.com/freelancers/subtitles'}
        call = [self.parse_transcription, self.parse_caption, self.parse_subtitles]

        return [
            scrapy.Request(links['transcription_page'], callback=call[0]),
            scrapy.Request(links['captions_page'], callback=call[1]),
            scrapy.Request(links['subtitles_page'], callback=call[2])
        ]

Yes, you can have a list comprehension do the looping so that there is only one instance of the text scrapy.Request() in the program, but of course being a loop the function will be called once per loop:

class SiteFetching(scrapy.Spider):
    name = 'Site'

    def start_requests(self):
        links = [('https://www.rev.com/freelancers/transcription', self.parse_transcription),
                 ('https://www.rev.com/freelancers/captions', self.parse_caption),
                 ('https://www.rev.com/freelancers/subtitles', self.parse_subtitles)]

        return [scrapy.Request(link[0], callback=link[1]) for link in links]

Another option if you want to avoid making all the requests at once and waiting for them all to return is to use a generator expression:

        return (scrapy.Request(link[0], callback=link[1]) for link in links)

btw I know nothing about Spider etc

Now you call start_requests() but it returns a generator and you call next() on it to make each Request() :

sf = SiteFetching()   # I assume this is how you instantiate SiteFetching
gen = sf.start_requests()   # Only returns a generator
req = next(gen)   # Only here does the first call to Request() occur with callback to follow.

I only showed one instance of calling next() , but you could have a loop (or iterate over it with for), but any way you do it you get to say when the Request() occurs and what you do before and after each call.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM