简体   繁体   中英

Web Scraping using Scrapy

I am scraping the Flipcart website and I want to extract the image URL from the website. This is the link to the website .

import scrapy
from ..items import FlipcartItem
class QuotesSpider(scrapy.Spider):
    name='quotes'
    start_urls=[
        'https://www.flipkart.com/clothing-and-accessories/topwear/pr?sid=clo%2Cash&otracker=categorytree&p%5B%5D=facets.ideal_for%255B%255D%3DMen'
        ]
    def parse(self,response):
        items=FlipcartItem()
        image_url=response.css('._2r_T1I img::attr(src)').extract()
        #product_page_url=response.css('').extract()
        items['image_url']=image_url
        #items['product_page']=title
        yield items

This is the code I have written and while running the code I am getting the empty list.Like image_url ["","",""].Can anyone please suggest where I am going wrong?

This site is using javascript to load images that scrapy won't access. You need to use selenium to extract image data. Here i use scrapy Selector to extract image data with selenium. You may use scrapy with selenium if you want follow this url or scrapy splash .

from selenium import webdriver
from scrapy.selector import Selector
browser = webdriver.Firefox(executable_path='./geckodriver')
browser.get(url="https://www.flipkart.com/clothing-and-accessories/topwear/pr?sid=clo%2Cash&otracker=categorytree&p%5B%5D=facets.ideal_for%255B%255D%3DMen")

page = browser.page_source
image_data = Selector(text=page)
image_data = image_data.css('img._2r_T1I::attr(src)').extract()
# print(image_data.xpath('//div[@class="CXW8mj _21_khk"]/img/@src').get())

print(image_data)

If you need to install selenium, please follow this url .

You should consider changing this line:

image_url=response.css('._2r_T1I img::attr(src)').extract()

To this,

image_urls=response.css('img._2r_T1I').xpath('@src').getall()

Also you should be aware that your "image_url" is going to be an array even if there's only one item, as that's what scrapy returns. You may want to iterate over the results and create a new FlipcartItem for each one, or if you only expect one result you may want to pull it out of the list.

I Tried Doing This:

import scrapy
class QuotesSpider(scrapy.Spider):
    name='quotes'
    start_urls=[
        'https://www.flipkart.com/clothing-and-accessories/topwear/pr?sid=clo%2Cash&otracker=categorytree&p%5B%5D=facets.ideal_for%255B%255D%3DMen'
        ]
    def parse(self,response):
        raw_image_urls=response.css('img._2r_T1I').xpath('@src').getall()
        clean_image_urls=[]
        for img_url in raw_image_urls:
            clean_image_urls.append(response.urljoin(img_url))
        yield{
        'image_urls':clean_image_urls
        }

But getting the URL of the main page.Not image url.

This is a Javascript generated content site. Use "View page source" and you can see that the image src is empty. Nothings wrong with the code. Just use Selenium or Scrapy Splash they load all the javascripts for you so you can scraped the data.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM