简体   繁体   English

python scrapy:蜘蛛会跟踪链接,但不会下载图像

[英]python scrapy: spider follows links but won't download images

I've built a basic a crawlspider to scrape the comic images from xkcd and follow links to each comic and continue scraping. 我已经建立了一个基本的crawlspider来从xkcd刮取漫画图像,并跟踪每个漫画的链接并继续进行刮取。 The spider follows links just fine but I'm having trouble actually scraping the image. 蜘蛛程序跟踪链接很好,但是我在实际抓取图像时遇到了麻烦。

I've tried multiple xpath and css selectors and ways of writing the parse_item method but I'm either getting errors due to scrapy trying to use the first letter of the url as the full url, or unhashable type 'list' errors and have run out of ideas. 我已经尝试了多个xpath和CSS选择器以及编写parse_item方法的方法,但是由于尝试将url的第一个字母用作完整url或使用无法散列的类型“列表”而报错,因此我遇到了错误没主意。

Spider: 蜘蛛:

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class XkcdSpider(CrawlSpider):
    name = 'xkcd'
    allowed_domains = ['xkcd.com']
    start_urls = ['http://xkcd.com/']

    rules = (
    Rule(LinkExtractor(allow=r'\/\d{4}\/', unique=True),
         callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        i = {}
        relative_url = response.xpath(
            '//*[@id="comic"]/img/@src').extract_first()

        absolute_url = response.urljoin(relative_url)
        i['image_urls'] = absolute_url
        return i

Items: 项目:

import scrapy


class XkcdItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    images = scrapy.Field()
    image_urls = scrapy.Field()

Image pipeline is set up like this: 图像管道的设置如下:

ITEM_PIPELINES = {
   'scrapy.pipelines.images.ImagesPipeline': 1,
}

Traceback is either this: 回溯是这样的:

TypeError: unhashable type: 'list'

Or this: 或这个:

ValueError: Missing scheme in request url: h

Which I understand is from scrapy trying to use the first letter of the url rather than the whole thing but I can't find a way to make it work, have tried just .extract() rather than extract_first() but that doesn't work. 我的理解是从尝试使用url的第一个字母而不是整个内容而来的,但是我无法找到一种使之起作用的方法,只是尝试了.extract()而不是extract_first()但这没有工作。

Any help greatly appreciated 任何帮助,不胜感激

Try it like this 像这样尝试

srcs = response.xpath('//*[@id="comic"]/img/@src').extract()
i['image_urls'] = [response.urljoin(src) for src in srcs]

Probably you've already have done this but, just in case, be sure to set correctly the IMAGES_STORE setting. 可能您已经完成了此操作,但以防万一,请确保正确设置IMAGES_STORE设置。

Upon asking OP about the output of absolute_url he replied. 当询问OP关于absolute_url的输出时,他回答了。

[root] INFO: imgs.xkcd.com/comics/state_borders.png Does this look right? – 

This is incorrect, that is what scraper is telling you, Missing scheme in request url: means your URL is missing the HTTP scheme information. 这是不正确的,这就是scraper告诉您的, Missing scheme in request url:表示您的URL缺少HTTP方案信息。

Also provide a list to dsads variable. 还提供dsads变量的列表。

i['image_urls'] = ["https://" + absolute_url] #adding scheme to URL

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM