简体   繁体   English

Web 使用 Scrapy 抓取

[英]Web Scraping using Scrapy

I am scraping the Flipcart website and I want to extract the image URL from the website.我正在抓取 Flipcart 网站,我想从网站上提取图像 URL。 This is the link to the website . 这是网站的链接

import scrapy
from ..items import FlipcartItem
class QuotesSpider(scrapy.Spider):
    name='quotes'
    start_urls=[
        'https://www.flipkart.com/clothing-and-accessories/topwear/pr?sid=clo%2Cash&otracker=categorytree&p%5B%5D=facets.ideal_for%255B%255D%3DMen'
        ]
    def parse(self,response):
        items=FlipcartItem()
        image_url=response.css('._2r_T1I img::attr(src)').extract()
        #product_page_url=response.css('').extract()
        items['image_url']=image_url
        #items['product_page']=title
        yield items

This is the code I have written and while running the code I am getting the empty list.Like image_url ["","",""].Can anyone please suggest where I am going wrong?这是我编写的代码,在运行代码时我得到了一个空列表。像 image_url ["","",""]。有人可以建议我哪里出错了吗?

This site is using javascript to load images that scrapy won't access.该站点正在使用 javascript 加载 scrapy 无法访问的图像。 You need to use selenium to extract image data.您需要使用selenium来提取图像数据。 Here i use scrapy Selector to extract image data with selenium.在这里,我使用 scrapy 选择器通过 selenium 提取图像数据。 You may use scrapy with selenium if you want follow this url or scrapy splash .如果您想遵循此urlZ3CD13A277FBC2FEA5EF64364BF854C8 ,您可以使用 scrapy 和 selenium。

from selenium import webdriver
from scrapy.selector import Selector
browser = webdriver.Firefox(executable_path='./geckodriver')
browser.get(url="https://www.flipkart.com/clothing-and-accessories/topwear/pr?sid=clo%2Cash&otracker=categorytree&p%5B%5D=facets.ideal_for%255B%255D%3DMen")

page = browser.page_source
image_data = Selector(text=page)
image_data = image_data.css('img._2r_T1I::attr(src)').extract()
# print(image_data.xpath('//div[@class="CXW8mj _21_khk"]/img/@src').get())

print(image_data)

If you need to install selenium, please follow this url .如果您需要安装 selenium,请按照此url 操作

You should consider changing this line:您应该考虑更改此行:

image_url=response.css('._2r_T1I img::attr(src)').extract()

To this,对此,

image_urls=response.css('img._2r_T1I').xpath('@src').getall()

Also you should be aware that your "image_url" is going to be an array even if there's only one item, as that's what scrapy returns.此外,您应该知道,即使只有一个项目,您的“image_url”也将是一个数组,因为这就是 scrapy 返回的内容。 You may want to iterate over the results and create a new FlipcartItem for each one, or if you only expect one result you may want to pull it out of the list.您可能希望遍历结果并为每个结果创建一个新的FlipcartItem ,或者如果您只期望一个结果,您可能希望将其从列表中拉出。

I Tried Doing This:我试过这样做:

import scrapy
class QuotesSpider(scrapy.Spider):
    name='quotes'
    start_urls=[
        'https://www.flipkart.com/clothing-and-accessories/topwear/pr?sid=clo%2Cash&otracker=categorytree&p%5B%5D=facets.ideal_for%255B%255D%3DMen'
        ]
    def parse(self,response):
        raw_image_urls=response.css('img._2r_T1I').xpath('@src').getall()
        clean_image_urls=[]
        for img_url in raw_image_urls:
            clean_image_urls.append(response.urljoin(img_url))
        yield{
        'image_urls':clean_image_urls
        }

But getting the URL of the main page.Not image url.但是获取主页的URL。不是图像url。

This is a Javascript generated content site.这是一个 Javascript 生成的内容站点。 Use "View page source" and you can see that the image src is empty.使用“查看页面源”可以看到图片src是空的。 Nothings wrong with the code.代码没有错。 Just use Selenium or Scrapy Splash they load all the javascripts for you so you can scraped the data.只需使用SeleniumScrapy Splash它们会为您加载所有 javascript,以便您可以抓取数据。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM