簡體   English   中英

使用scrapy從網站下載並保存圖像

[英]download and save images from a website using scrapy

我是scrapy和Python的新手,所以我的問題可能很簡單。 通過使用現有的網站指南,我編寫了一個刮板,該刮板可以刮擦網站的頁面,並在輸出文件中顯示圖像的URL,名稱和...。 我想將圖像下載到目錄中,但輸出目錄為空!

這是我的代碼:

myspider.py

import scrapy
class BrickSetSpider(scrapy.Spider):
name = 'brick_spider`enter code here`'
start_urls = ['http://brickset.com/sets/year-2016']

def parse(self, response):
    SET_SELECTOR = '.set'
    for brickset in response.css(SET_SELECTOR):

        NAME_SELECTOR = 'h1 a ::text'
        PIECES_SELECTOR = './/dl[dt/text() = "Pieces"]/dd/a/text()'
        MINIFIGS_SELECTOR = './/dl[dt/text() = "Minifigs"]/dd[2]/a/text()'
        IMAGE_SELECTOR = 'img ::attr(src)'
        yield {
            'name': brickset.css(NAME_SELECTOR).extract_first(),
            'pieces': brickset.xpath(PIECES_SELECTOR).extract_first(),
            'minifigs': brickset.xpath(MINIFIGS_SELECTOR).extract_first(),
            'image': brickset.css(IMAGE_SELECTOR).extract_first(),
        }

    NEXT_PAGE_SELECTOR = '.next a ::attr(href)'
    next_page = response.css(NEXT_PAGE_SELECTOR).extract_first()
    if next_page:
        yield scrapy.Request(
            response.urljoin(next_page),
            callback=self.parse
        )

settings.py

ITEM_PIPELINES = {'brickset.pipelines.BricksetPipeline': 1}
IMAGES_STORE = '/home/nmd/brickset/brickset/spiders/output'


#items.py 
import scrapy
class BrickSetSpider(scrapy.Item):
image_urls = scrapy.Field()
images = scrapy.Field()
pass

如果您有興趣下載文件或圖像,Scrapy可提供媒體管道

ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline': 1}

然后,您需要在項目中添加image_urls以便管道下載文件,因此請更改

    yield {
        'name': brickset.css(NAME_SELECTOR).extract_first(),
        'pieces': brickset.xpath(PIECES_SELECTOR).extract_first(),
        'minifigs': brickset.xpath(MINIFIGS_SELECTOR).extract_first(),
        'image': brickset.css(IMAGE_SELECTOR).extract_first(),
    }

    yield {
        'name': brickset.css(NAME_SELECTOR).extract_first(),
        'pieces': brickset.xpath(PIECES_SELECTOR).extract_first(),
        'minifigs': brickset.xpath(MINIFIGS_SELECTOR).extract_first(),
        'image_urls': brickset.css(IMAGE_SELECTOR).extract_first(),
    }

有關更多詳細信息,請參閱https://doc.scrapy.org/en/latest/topics/media-pipeline.html

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM