简体   繁体   English

Python + Scrapy:从脚本运行搜寻器时,运行“ ImagesPipeline”时出现问题

[英]Python + Scrapy: Issues running “ImagesPipeline” when running crawler from script

I'm brand new to Python so I apologize if there's a dumb mistake here...I've been scouring the web for days, looking at similar issues and combing through Scrapy docs and nothing seems to really resolve this for me... 我是Python的新手,所以如果在这里出现一个愚蠢的错误,我深表歉意。。。

I have a Scrapy project which successfully scrapes the source website , returns the required items, and then uses an ImagePipeline to download (and then rename accordingly) the images from the returned image links... but only when I run from the terminal with " runspider ". 我有一个Scrapy项目,该项目成功地抓取了源网站 ,返回了所需的项目,然后使用ImagePipeline从返回的图像链接中下载(然后相应地重命名)了图像…… 但是只有当我从终端运行“ runpider “。

Whenever I use " crawl " from the terminal or CrawlProcess to run the spider from within the script, it returns the items but does not download the images and, I assume, completely misses the ImagePipeline. 每当我从终端或CrawlProcess使用“ 爬网 ”从脚本内运行爬虫时它都会返回项目,但不会下载图像,并且我认为完全错过了ImagePipeline。

I read that I needed to import my settings when running this way in order to properly load the pipeline, which makes sense after looking into the differences between " crawl " and " runspider " but I still cannot get the pipeline working. 我读到我需要以这种方式运行时导入设置以正确加载管道,这在研究了“ 爬网 ”和“ runspider ”之间的区别之后才有意义,但我仍然无法使管道正常工作。

There are no error messages but I notice that it does return "[scrapy.middleware] INFO: Enabled item pipelines: []" ... Which I assumed was showing that it is still missing my pipeline? 没有错误消息,但是我注意到它的确返回“ [scrapy.middleware] INFO:已启用的项目管道:[]” ...我以为是在显示它仍然缺少我的管道?

Here's my spider.py: 这是我的spider.py:

 import scrapy from scrapy2.items import Scrapy2Item from scrapy.crawler import CrawlerProcess from scrapy.utils.project import get_project_settings class spider1(scrapy.Spider): name = "spider1" domain = "https://www.amazon.ca/s?k=821826022317" def start_requests(self): yield scrapy.Request(url=spider1.domain ,callback = self.parse) def parse(self, response): items = Scrapy2Item() titlevar = response.css('span.a-text-normal ::text').extract_first() imgvar = [response.css('img ::attr(src)').extract_first()] skuvar = response.xpath('//meta[@name="keywords"]/@content')[0].extract() items['title'] = titlevar items['image_urls'] = imgvar items['sku'] = skuvar yield items process = CrawlerProcess(get_project_settings()) process.crawl(spider1) process.start() 

Here is my items.py: 这是我的items.py:

 import scrapy class Scrapy2Item(scrapy.Item): title = scrapy.Field() image_urls = scrapy.Field() sku = scrapy.Field() 

Here is my pipelines.py: 这是我的pipelines.py:

 import scrapy from scrapy.pipelines.images import ImagesPipeline class Scrapy2Pipeline(ImagesPipeline): def get_media_requests(self, item, info): return [scrapy.Request(x, meta={'image_name': item['sku']}) for x in item.get('image_urls', [])] def file_path(self, request, response=None, info=None): return '%s.jpg' % request.meta['image_name'] 

Here is my settings.py: 这是我的settings.py:

 BOT_NAME = 'scrapy2' SPIDER_MODULES = ['scrapy2.spiders'] NEWSPIDER_MODULE = 'scrapy2.spiders' ROBOTSTXT_OBEY = True ITEM_PIPELINES = { 'scrapy2.pipelines.Scrapy2Pipeline': 1, } IMAGES_STORE = 'images' 

Thank you to anybody that looks at this or even attempts to help me out. 谢谢任何关注此事甚至试图帮助我的人。 It's greatly appreciated. 非常感谢。

Since you are running your spider as a script , there is no scrapy project environment, get_project_settings won't work (aside from grabbing the default settings). 由于您将Spider作为脚本运行,因此没有任何繁琐的项目环境, get_project_settings将不起作用(除了获取默认设置之外)。 The script must be self-contained, ie contain everything you need to run your spider (or import it from your python search path, like any regular old python code). 该脚本必须是自包含的,即包含运行蜘蛛(或从python搜索路径导入它,如任何常规的旧python代码)所需的所有内容。

I've reformatted that code for you, so that it runs, when you execute it with the plain python interpreter: python3 script.py . 我已经为您重新格式化了该代码,以便当您使用简单的python解释器python3 script.py执行代码时,该代码即可运行。

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import scrapy
from scrapy.pipelines.images import ImagesPipeline

BOT_NAME = 'scrapy2'
ROBOTSTXT_OBEY = True
IMAGES_STORE = 'images'


class Scrapy2Item(scrapy.Item):
    title = scrapy.Field()
    image_urls = scrapy.Field()
    sku = scrapy.Field()

class Scrapy2Pipeline(ImagesPipeline):
    def get_media_requests(self, item, info):
        return [scrapy.Request(x, meta={'image_name': item['sku']})
                for x in item.get('image_urls', [])]

    def file_path(self, request, response=None, info=None):
        return '%s.jpg' % request.meta['image_name']

class spider1(scrapy.Spider):
    name = "spider1"
    domain = "https://www.amazon.ca/s?k=821826022317"

    def start_requests(self):
        yield scrapy.Request(url=spider1.domain ,callback = self.parse)

    def parse(self, response):

        items = Scrapy2Item()

        titlevar = response.css('span.a-text-normal ::text').extract_first()
        imgvar = [response.css('img ::attr(src)').extract_first()]
        skuvar = response.xpath('//meta[@name="keywords"]/@content')[0].extract()

        items['title'] = titlevar
        items['image_urls'] = imgvar
        items['sku'] = skuvar

        yield items

if __name__ == "__main__":
    from scrapy.crawler import CrawlerProcess
    from scrapy.settings import Settings

    settings = Settings(values={
        'BOT_NAME': BOT_NAME,
        'ROBOTSTXT_OBEY': ROBOTSTXT_OBEY,
        'ITEM_PIPELINES': {
            '__main__.Scrapy2Pipeline': 1,
        },
        'IMAGES_STORE': IMAGES_STORE,
        'TELNETCONSOLE_ENABLED': False,
    })

    process = CrawlerProcess(settings=settings)
    process.crawl(spider1)
    process.start()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM