简体   繁体   中英

Python scrapy start_urls

is it possible to do something like below but with multiple url like below? Each link will have about 50 pages to crawl and loop. The current solution is working but only working if I use 1 URL instead of multiple urls.

 start_urls = [

'https://www.xxxxxxx.com.au/home-garden/page-%s/c18397' % page for page in range(1, 50),
'https://www.xxxxxxx.com.au/automotive/page-%s/c21159' % page for page in range(1, 50),
'https://www.xxxxxxx.com.au/garden/page-%s/c25449' % page for page in range(1, 50),
 ]

We can perform the operation by using another list. I've shared the code for it below. Hope this is what you're looking for.

final_urls=[]
start_urls = [
'https://www.xxxxxxx.com.au/home-garden/page-%s/c18397',
'https://www.xxxxxxx.com.au/automotive/page-%s/c21159',
'https://www.xxxxxxx.com.au/garden/page-%s/c25449']
final_urls.extend(url % page for page in range(1, 50) for url in start_urls)
Output Snippet
def parse(self, response):

    for link in final_urls:
        request = scrapy.Request(link)
        yield request

About your latest enquiry, have you tried this?

 def parse(self, response): for link in final_urls: request = scrapy.Request(link) yield request 

I recommend to use start_requests for this:

def start_requests(self):
    base_urls = [

        'https://www.xxxxxxx.com.au/home-garden/page-{page_number}/c18397',
        'https://www.xxxxxxx.com.au/automotive/page-{page_number}/c21159',
        'https://www.xxxxxxx.com.au/garden/page-{page_number}/c25449',
    ]

    for page in range(1, 50):
        for base_url in base_urls:
            url = base_url.format( page_number=page )
            yield scrapy.Request( url, callback=self.parse )

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM