简体   繁体   English

如何制作循环以便不重复scrapy.request?

[英]How to make a loop in return so that not to repeat scrapy.request?

I am scraping a page. 我正在抓页。 I tried to make loop in return function but it didn't work. 我试图在return函数中创建循环,但它不起作用。 It gave me the result of just first link. 它给了我第一个链接的结果。 I want to make a loop so that I could return all three values. 我想创建一个循环,以便我可以返回所有三个值。

class SiteFetching(scrapy.Spider):
    name = 'Site'

    def start_requests(self):
        links = {'transcription_page': 'https://www.rev.com/freelancers/transcription',
                 'captions_page': 'https://www.rev.com/freelancers/captions',
                 'subtitles_page': 'https://www.rev.com/freelancers/subtitles'}
        call = [self.parse_transcription, self.parse_caption, self.parse_subtitles]

        return [
            scrapy.Request(links['transcription_page'], callback=call[0]),
            scrapy.Request(links['captions_page'], callback=call[1]),
            scrapy.Request(links['subtitles_page'], callback=call[2])
        ]

Yes, you can have a list comprehension do the looping so that there is only one instance of the text scrapy.Request() in the program, but of course being a loop the function will be called once per loop: 是的,你可以让列表理解做循环,这样程序中只有一个文本scrapy.Request()实例,但当然是一个循环,每个循环调用一次函数:

class SiteFetching(scrapy.Spider):
    name = 'Site'

    def start_requests(self):
        links = [('https://www.rev.com/freelancers/transcription', self.parse_transcription),
                 ('https://www.rev.com/freelancers/captions', self.parse_caption),
                 ('https://www.rev.com/freelancers/subtitles', self.parse_subtitles)]

        return [scrapy.Request(link[0], callback=link[1]) for link in links]

Another option if you want to avoid making all the requests at once and waiting for them all to return is to use a generator expression: 如果你想避免一次发出所有请求并等待它们全部返回的另一个选择是使用生成器表达式:

        return (scrapy.Request(link[0], callback=link[1]) for link in links)

btw I know nothing about Spider etc 顺便说一句我对蜘蛛等一无所知

Now you call start_requests() but it returns a generator and you call next() on it to make each Request() : 现在你调用start_requests()但是它返回一个生成器并在其上调用next()来生成每个Request()

sf = SiteFetching()   # I assume this is how you instantiate SiteFetching
gen = sf.start_requests()   # Only returns a generator
req = next(gen)   # Only here does the first call to Request() occur with callback to follow.

I only showed one instance of calling next() , but you could have a loop (or iterate over it with for), but any way you do it you get to say when the Request() occurs and what you do before and after each call. 我只展示了一个调用next()实例,但是你可以有一个循环(或者用for迭代它),但是你做任何事都可以说当Request()发生时你做什么以及你在每个之前和之后做什么呼叫。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM