简体   繁体   English

如何在dyanmic文件夹中下载scrapy图像

[英]How to download scrapy images in a dyanmic folder based on

I'm trying to override default path full/hash.jpg to <dynamic>/hash.jpg , I've tried How to download scrapy images in a dyanmic folder using following code: 我正在尝试覆盖默认路径full/hash.jpg<dynamic>/hash.jpg ,我已经尝试过如何使用以下代码在dyanmic文件夹中下载scrapy图像

def item_completed(self, results, item, info):

    for result in [x for ok, x in results if ok]:
        path = result['path']
        # here we create the session-path where the files should be in the end
        # you'll have to change this path creation depending on your needs
        slug = slugify(item['category'])
        target_path = os.path.join(slug, os.path.basename(path))

        # try to move the file and raise exception if not possible
        if not os.rename(path, target_path):
            raise DropItem("Could not move image to target folder")

    if self.IMAGES_RESULT_FIELD in item.fields:
        item[self.IMAGES_RESULT_FIELD] = [x for ok, x in results if ok]
    return item

but I get: 但我得到:

Traceback (most recent call last):
    File "/home/user/.venv/sepid/lib/python2.7/site-packages/twisted/internet/defer.py", line 577, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
    File "/home/user/.venv/sepid/lib/python2.7/site-packages/twisted/internet/defer.py", line 839, in _cbDeferred
    self.callback(self.resultList)
    File "/home/user/.venv/sepid/lib/python2.7/site-packages/twisted/internet/defer.py", line 382, in callback
    self._startRunCallbacks(result)
    File "/home/user/.venv/sepid/lib/python2.7/site-packages/twisted/internet/defer.py", line 490, in _startRunCallbacks
    self._runCallbacks()
    --- <exception caught here> ---
    File "/home/user/.venv/sepid/lib/python2.7/site-packages/twisted/internet/defer.py", line 577, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
    File "/home/user/Projects/sepid/scraper/scraper/pipelines.py", line 44, in item_completed
    if not os.rename(path, target_path):
    exceptions.OSError: [Errno 2] No such file or directory

I don't know what's wrong, also is there any other way to change the path? 我不知道什么是错的,还有其他方法可以改变路径吗? Thanks 谢谢

I have created a pipeline inherited from ImagesPipeline and overridden file_path method and used it instead of standard ImagesPipeline 我创建了一个继承自ImagesPipeline并覆盖file_path方法的管道,并使用它代替标准的ImagesPipeline

class StoreImgPipeline(ImagesPipeline):
    def file_path(self, request, response=None, info=None):
        image_guid = hashlib.sha1(to_bytes(request.url)).hexdigest()
        return 'realty-sc/%s/%s/%s/%s.jpg' % (YEAR, image_guid[:2], image_guid[2:4], image_guid)

Problem raises because dst folder doesn't exists, and quick solution is: 问题引发因为dst文件夹不存在,快速解决方案是:

def item_completed(self, results, item, info):

    for result in [x for ok, x in results if ok]:
        path = result['path']
        slug = slugify(item['designer'])


        settings = get_project_settings()
        storage = settings.get('IMAGES_STORE')

        target_path = os.path.join(storage, slug, os.path.basename(path))
        path = os.path.join(storage, path)

        # If path doesn't exist, it will be created
        if not os.path.exists(os.path.join(storage, slug)):
            os.makedirs(os.path.join(storage, slug))

        if not os.rename(path, target_path):
            raise DropItem("Could not move image to target folder")

    if self.IMAGES_RESULT_FIELD in item.fields:
        item[self.IMAGES_RESULT_FIELD] = [x for ok, x in results if ok]
    return item

To dynamically set the path for images downloaded by a scrapy spider prior to downloading images rather than moving them afterward, I created a custom pipeline overriding the get_media_requests and file_path methods. 为了在下载图像之前动态设置scrapy spider下载的图像的路径而不是之后移动它,我创建了一个覆盖get_media_requestsfile_path方法的自定义管道。

class MyImagesPipeline(ImagesPipeline):

    def get_media_requests(self, item, info):
        return [Request(url, meta={'f1':item.get('field1'), 'f2':item.get('field2'), 'f3':item.get('field3'), 'f4':item.get('field4')}) for url in item.get(self.images_urls_field, [])]

    def file_path(self, request, response=None, info=None):
        ## start of deprecation warning block (can be removed in the future)
        def _warn():
            from scrapy.exceptions import ScrapyDeprecationWarning
            import warnings
            warnings.warn('ImagesPipeline.image_key(url) and file_key(url) methods are deprecated, '
                      'please use file_path(request, response=None, info=None) instead',
                      category=ScrapyDeprecationWarning, stacklevel=1)

        # check if called from image_key or file_key with url as first argument
        if not isinstance(request, Request):
            _warn()
            url = request
        else:
            url = request.url

        # detect if file_key() or image_key() methods have been overridden
        if not hasattr(self.file_key, '_base'):
            _warn()
            return self.file_key(url)
        elif not hasattr(self.image_key, '_base'):
            _warn()
            return self.image_key(url)
        ## end of deprecation warning block

        image_guid = hashlib.sha1(to_bytes(url)).hexdigest()
        return '%s/%s/%s/%s/%s.jpg' % (request.meta['f1'], request.meta['f2'], request.meta['f3'], request.meta['f4'], image_guid)

This approach assumes you define a scrapy.Item in your spider and replace, eg, "field1" with your particular field name. 这种方法假设您在蜘蛛中定义scrapy.Item并将您的特定字段名称替换为“field1”。 Setting Request.meta in get_media_requests allows item field values to be used in setting download directories for each item, as shown in the return statement for file_path . get_media_requests设置Request.meta允许项目字段值用于设置每个项目的下载目录,如file_path的return语句所示。 Scrapy will create the directories automatically if they don't exist. 如果目录不存在,Scrapy将自动创建目录。

Custom pipeline class definitions are saved in my project's pipelines.py . 自定义管道类定义保存在项目的pipelines.py Methods here are adapted directly from the default scrapy pipeline images.py , which on my Mac is stored in ~/anaconda3/pkgs/scrapy-1.5.0-py36_0/lib/python3.6/site-packages/scrapy/pipelines/ . 这里的方法直接适用于默认的scrapy管道images.py ,它在我的Mac上存储在~/anaconda3/pkgs/scrapy-1.5.0-py36_0/lib/python3.6/site-packages/scrapy/pipelines/ Includes and additional methods can be copied from that file as needed. 可以根据需要从该文件中复制包含和其他方法。

the solution that @neelix give is the best one , but i'm trying to use it and i found some strange results , some documents are moved but not all the documents. @neelix提供的解决方案是最好的,但我正在尝试使用它,我发现了一些奇怪的结果,一些文件被移动但不是所有文件。 So i replaced : 所以我换了:

if not os.rename(path, target_path):
            raise DropItem("Could not move image to target folder")

and i imported shutil library , then my code is : 我导入了shutil库,然后我的代码是:

def item_completed(self, results, item, info):

    for result in [x for ok, x in results if ok]:
        path = result['path']
        slug = slugify(item['designer'])


        settings = get_project_settings()
        storage = settings.get('IMAGES_STORE')

        target_path = os.path.join(storage, slug, os.path.basename(path))
        path = os.path.join(storage, path)

        # If path doesn't exist, it will be created
        if not os.path.exists(os.path.join(storage, slug)):
            os.makedirs(os.path.join(storage, slug))

        shutil.move(path, target_path)

    if self.IMAGES_RESULT_FIELD in item.fields:
        item[self.IMAGES_RESULT_FIELD] = [x for ok, x in results if ok]
    return item

i hope that it will work also for u guys :) 我希望它也适用于你们:)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM