简体   繁体   English

无法从网站上下载图像

[英]Cannot download images from website with scrapy

I'm starting with Scrapy in order to automatize file downloading from websites. 我从Scrapy开始,以便自动从网站下载文件。 As a test, I want to download the jpg files from this website. 作为测试,我想从网站下载jpg文件。 My code is based on the intro tutorial and the Files and Images Pipeline tutorial on the Scrapy website. 我的代码基于Scrapy网站上的入门教程和“ 文件和图像管道”教程

My code is this: 我的代码是这样的:

In settings.py, I have added these lines: 在settings.py中,我添加了以下几行:

ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline': 1}

IMAGES_STORE = '/home/lucho/Scrapy/jpg/'

My items.py file is: 我的items.py文件是:

import scrapy

class JpgItem(scrapy.Item):
    image_urls = scrapy.Field()
    images = scrapy.Field()
    pass

My pipeline file is: 我的管道文件是:

import scrapy
from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItem

class JpgPipeline(object):
    def process_item(self, item, spider):
        return item
    def get_media_requests(self, item, info):
        for image_url in item['image_urls']:
            yield scrapy.Request(image_url)

    def item_completed(self, results, item, info):
        image_paths = [x['path'] for ok, x in results if ok]
        if not image_paths:
            raise DropItem("Item contains no images")
        item['image_paths'] = image_paths
        return item

Finally, my spider file is: 最后,我的蜘蛛文件是:

import scrapy
from .. items import JpgItem

class JpgSpider(scrapy.Spider):
    name = "jpg"
    allowed_domains = ["http://www.kevinsmedia.com"]
    start_urls = [
        "http://www.kevinsmedia.com/km/mp3z/Fluke/Risotto/"
    ]

def init_request(self):
    #"""This function is called before crawling starts."""
    return Request(url=self.login_page, callback=self.parse)

def parse(self, response):
    item = JpgItem()
    return item

( Ideally, I want to download all jpg, without specifying exact web addresses for each file needed ) 理想情况下,我要下载所有jpg,而不为每个所需文件指定确切的网址

The output of "scrapy crawl jpg" is: “ scrapy crawl jpg”的输出为:

2015-12-08 19:19:30 [scrapy] INFO: Scrapy 1.0.3.post6+g2d688cd started (bot: jpg)
2015-12-08 19:19:30 [scrapy] INFO: Optional features available: ssl, http11
2015-12-08 19:19:30 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'jpg.spiders', 'SPIDER_MODULES': ['jpg.spiders'], 'COOKIES_ENABLED': False, 'DOWNLOAD_DELAY': 3, 'BOT_NAME': 'jpg'}
2015-12-08 19:19:30 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-12-08 19:19:30 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-12-08 19:19:30 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-12-08 19:19:30 [scrapy] INFO: Enabled item pipelines: ImagesPipeline
2015-12-08 19:19:30 [scrapy] INFO: Spider opened
2015-12-08 19:19:30 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-12-08 19:19:30 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-12-08 19:19:31 [scrapy] DEBUG: Crawled (200) <GET http://www.kevinsmedia.com/km/mp3z/Fluke/Risotto/> (referer: None)
2015-12-08 19:19:31 [scrapy] DEBUG: Scraped from <200 http://www.kevinsmedia.com/km/mp3z/Fluke/Risotto/>
{'images': []}
2015-12-08 19:19:31 [scrapy] INFO: Closing spider (finished)
2015-12-08 19:19:31 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 254,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 2975,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2015, 12, 8, 22, 19, 31, 294139),
 'item_scraped_count': 1,
 'log_count/DEBUG': 3,
 'log_count/INFO': 7,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2015, 12, 8, 22, 19, 30, 619918)}
2015-12-08 19:19:31 [scrapy] INFO: Spider closed (finished)

While there seems to be no error, the program is not retrieving the jpg files. 尽管似乎没有错误,但程序未检索jpg文件。 In case it matters, I'm using Ubuntu. 万一重要,我正在使用Ubuntu。

You haven't defined parse() in your JpgSpider class. 您尚未在JpgSpider类中定义parse()

Update. 更新。 This doesn't look like a problem you should be attacking with scrapy now that I can see the URL in your update. 这看起来不像是问题,您应该立即抓狂地攻击,因为我可以在更新中看到该URL。 WGET might be more appropriate, have a look at the answers here . WGET可能更合适,请在此处查看答案 In particular, look at the first comment to the top answer to see how to use file extension to limit which files you download ( -A jpg ). 特别是,请查看顶部答案的第一个注释,以了解如何使用文件扩展名限制下载的文件( -A jpg )。

Update 2: The parse() routine can get the album art URLs from the <a> tag using this code 更新2:parse()例程可以使用此代码从<a>标记获取专辑插图的URL

part_urls = response.xpath('//a[contains(., "AlbumArt")]/@href')

This returns a list of partial URLs, you will need to add the root URL for the page you are parsing from response.url . 这将返回部分URL的列表,您需要从response.url添加要解析的页面的根URL。 There are a few % codes in the URLs I've looked at, they may be a problem but try it anyway. 我查看过的网址中有一些%代码,它们可能是个问题,但还是请尝试一下。 Once you have a list of the full URLs, put them into item[] 获得完整URL的列表后,将其放入item []

item['image_urls'] = full_urls
yield item

This should get scrapy to automatically download the images, so you can remove your middleware and let scrapy do the heavy lifting. 自动下载图像应该很麻烦,因此您可以删除中间件,让scrapy做繁重的工作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM