简体   繁体   中英

Scrapy - how to parse multiple start_urls

I am working on my first scrapy project and starting with a fairly simple website stockx.

I would like to scrape the different categories of items. If I use the below URLs as my start_urls. How do I parse through each start URL?

https://stockx.com/sneakers ', https://stockx.com/streetwear ', https://stockx.com/collectibles ', https://stockx.com/handbags ', https://stockx.com/watches

The product page is typically structured as the following:

https://stockx.com/air-max-90-patta-homegrown-grass

I am trying to read through the documentation on this topic but couldn't quite follow it.

I know the below isn't right because I'm forcing a list of result URLs, just not sure how the multiple start_urls should be processed in the first parse.

   def parse(self, response):

        #obtain number of pages per product category 
        text = list(map(lambda x: x.split('='), 
    response.xpath('//a[@class="PagingButton__PaginationButton-sc-1o2t560- 
    0 eZnUxt"]/@href').extract()))
    total_pages = int(text[-1][-1])
    #compile a list of URLs for each result page 
    cat =['sneakers','streetwear','collectibles','handbags','watches']   
    cat = ['https://stockx.com/{}'.format(x) for x in cat]

    lst=[]

    for x in cat:
        for y in range (1,total_pages+1):
            result_urls=lst.append(x+'?page={}'.format(y))

    for url in result_urls[7:9]:
    # print('Lets try: ', url)
        yield Request(url=url, callback=self.parse_results)

Try something like this -

class ctSpider(Spider):
name = "stack"
    def start_requests(self):
        for d in [URLS]:
            yield Request(d,callback=self.parse)
...

Simple solution is use using start_urls : https://doc.scrapy.org/en/1.4/topics/spiders.html#scrapy.spiders.Spider.start_urls

class MLBoddsSpider(BaseSpider):
   name = "stockx.com"
   allowed_domains = ["stockx.com"]
   start_urls = [
       "https://stockx.com/watches",
       "https://stockx.com/collectibles",

   ]

   def parse(self, response):
         ................
         ........

you can even control the start_requests .

You can use a list comprehension in place of the initial start_urls list. For instance...

class myScraper(scrapy.Spider):
  name = 'movies'  
  abc = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
  url= "amazon.com/movies/{x}"
  start_urls = [url.format(x) for x in abc]

Note : Please don't run this, this was just for inspiration purposes. I did something like this in a project a while back(and was too lazy look it up again) and it worked. It saves you the time of having to create a custom start_requests function.

The url I used does not exist its just an example of something you can do.

The main idea here is to use a list comprehension in place of the default start_urls list so that you dont have to make a fancy function.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM