简体   繁体   中英

Multiple Request to Single Field in Scrapy

I am trying to scrape a website using Scrapy. Example Link is: Here . I am able to get some data using css selectors. I also need to fetch all image urls of each item. Now an item can have multiple colours. When we click on another colour, it actually fetch images from another url in the browser. So, I need to generate manual requests (due to multiple colours) and attach "meta" to store image urls from others urls into a SINGLE ITEM FIELD .

Here is my Scrapy code:

def get_image_urls(self, response):
    item = response.meta['item']
    if 'image_urls' in item:
        urls = item['image_urls']
    else:
        urls = []
    urls.extend(response.css('.product-image-link::attr(href)').extract())
    item['image_urls'] = urls
    next_url = response.css('.va-color .emptyswatch a::attr(href)').extract()
    #print(item['image_urls'])
    yield Request(next_url[0], callback=self.get_image_urls, meta={'item': item})

def parse(self, response):
    output = JulesProduct()
    output['name'] = self.get_name(response)

    # Now get the recursive img urls
    response.meta['item'] = output
    self.get_image_urls(response)
    return output

Ideally, I should return output object to have all of the required data. My question is why I am not getting output['image_urls']? Because when I uncomment print statement in get_image_urls function, I see 3 crawled urls and 3 print statements with url appended after each other. I need them in the parse function. I'm not sure if I'm able to dictate my issue. Can anybody help?

Your parse method is returning the output before the get_image_urls requests are done.

You should only yield or return your final item and at the end of your recursive logic. Something like this should work:

def parse(self, response):
    output = JulesProduct()
    output['name'] = self.get_name(response)
    yield Request(response.url, callback=self.get_image_urls, meta={'item': item}, dont_filter=True)

def get_image_urls(self, response):
    item = response.meta['item']
    if 'image_urls' in item:
        urls = item['image_urls']
    else:
        urls = []
    urls.extend(response.css('.product-image-link::attr(href)').extract())
    item['image_urls'] = urls
    next_url = response.css('.va-color .emptyswatch a::attr(href)').extract()

    if len(next_url) > 0:
        yield Request(next_url[0], callback=self.get_image_urls, meta={'item': item})
    else:
        yield item

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM