[英]How to download scrapy images in a dyanmic folder based on
I'm trying to override default path full/hash.jpg
to <dynamic>/hash.jpg
, I've tried How to download scrapy images in a dyanmic folder using following code: 我正在尝试覆盖默认路径
full/hash.jpg
到<dynamic>/hash.jpg
,我已经尝试过如何使用以下代码在dyanmic文件夹中下载scrapy图像 :
def item_completed(self, results, item, info):
for result in [x for ok, x in results if ok]:
path = result['path']
# here we create the session-path where the files should be in the end
# you'll have to change this path creation depending on your needs
slug = slugify(item['category'])
target_path = os.path.join(slug, os.path.basename(path))
# try to move the file and raise exception if not possible
if not os.rename(path, target_path):
raise DropItem("Could not move image to target folder")
if self.IMAGES_RESULT_FIELD in item.fields:
item[self.IMAGES_RESULT_FIELD] = [x for ok, x in results if ok]
return item
but I get: 但我得到:
Traceback (most recent call last):
File "/home/user/.venv/sepid/lib/python2.7/site-packages/twisted/internet/defer.py", line 577, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/home/user/.venv/sepid/lib/python2.7/site-packages/twisted/internet/defer.py", line 839, in _cbDeferred
self.callback(self.resultList)
File "/home/user/.venv/sepid/lib/python2.7/site-packages/twisted/internet/defer.py", line 382, in callback
self._startRunCallbacks(result)
File "/home/user/.venv/sepid/lib/python2.7/site-packages/twisted/internet/defer.py", line 490, in _startRunCallbacks
self._runCallbacks()
--- <exception caught here> ---
File "/home/user/.venv/sepid/lib/python2.7/site-packages/twisted/internet/defer.py", line 577, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/home/user/Projects/sepid/scraper/scraper/pipelines.py", line 44, in item_completed
if not os.rename(path, target_path):
exceptions.OSError: [Errno 2] No such file or directory
I don't know what's wrong, also is there any other way to change the path? 我不知道什么是错的,还有其他方法可以改变路径吗? Thanks
谢谢
I have created a pipeline inherited from ImagesPipeline
and overridden file_path
method and used it instead of standard ImagesPipeline
我创建了一个继承自
ImagesPipeline
并覆盖file_path
方法的管道,并使用它代替标准的ImagesPipeline
class StoreImgPipeline(ImagesPipeline):
def file_path(self, request, response=None, info=None):
image_guid = hashlib.sha1(to_bytes(request.url)).hexdigest()
return 'realty-sc/%s/%s/%s/%s.jpg' % (YEAR, image_guid[:2], image_guid[2:4], image_guid)
Problem raises because dst folder doesn't exists, and quick solution is: 问题引发因为dst文件夹不存在,快速解决方案是:
def item_completed(self, results, item, info):
for result in [x for ok, x in results if ok]:
path = result['path']
slug = slugify(item['designer'])
settings = get_project_settings()
storage = settings.get('IMAGES_STORE')
target_path = os.path.join(storage, slug, os.path.basename(path))
path = os.path.join(storage, path)
# If path doesn't exist, it will be created
if not os.path.exists(os.path.join(storage, slug)):
os.makedirs(os.path.join(storage, slug))
if not os.rename(path, target_path):
raise DropItem("Could not move image to target folder")
if self.IMAGES_RESULT_FIELD in item.fields:
item[self.IMAGES_RESULT_FIELD] = [x for ok, x in results if ok]
return item
To dynamically set the path for images downloaded by a scrapy spider prior to downloading images rather than moving them afterward, I created a custom pipeline overriding the get_media_requests
and file_path
methods. 为了在下载图像之前动态设置scrapy spider下载的图像的路径而不是之后移动它,我创建了一个覆盖
get_media_requests
和file_path
方法的自定义管道。
class MyImagesPipeline(ImagesPipeline):
def get_media_requests(self, item, info):
return [Request(url, meta={'f1':item.get('field1'), 'f2':item.get('field2'), 'f3':item.get('field3'), 'f4':item.get('field4')}) for url in item.get(self.images_urls_field, [])]
def file_path(self, request, response=None, info=None):
## start of deprecation warning block (can be removed in the future)
def _warn():
from scrapy.exceptions import ScrapyDeprecationWarning
import warnings
warnings.warn('ImagesPipeline.image_key(url) and file_key(url) methods are deprecated, '
'please use file_path(request, response=None, info=None) instead',
category=ScrapyDeprecationWarning, stacklevel=1)
# check if called from image_key or file_key with url as first argument
if not isinstance(request, Request):
_warn()
url = request
else:
url = request.url
# detect if file_key() or image_key() methods have been overridden
if not hasattr(self.file_key, '_base'):
_warn()
return self.file_key(url)
elif not hasattr(self.image_key, '_base'):
_warn()
return self.image_key(url)
## end of deprecation warning block
image_guid = hashlib.sha1(to_bytes(url)).hexdigest()
return '%s/%s/%s/%s/%s.jpg' % (request.meta['f1'], request.meta['f2'], request.meta['f3'], request.meta['f4'], image_guid)
This approach assumes you define a scrapy.Item
in your spider and replace, eg, "field1" with your particular field name. 这种方法假设您在蜘蛛中定义
scrapy.Item
并将您的特定字段名称替换为“field1”。 Setting Request.meta in get_media_requests
allows item field values to be used in setting download directories for each item, as shown in the return statement for file_path
. 在
get_media_requests
设置Request.meta允许项目字段值用于设置每个项目的下载目录,如file_path
的return语句所示。 Scrapy will create the directories automatically if they don't exist. 如果目录不存在,Scrapy将自动创建目录。
Custom pipeline class definitions are saved in my project's pipelines.py
. 自定义管道类定义保存在项目的
pipelines.py
。 Methods here are adapted directly from the default scrapy pipeline images.py
, which on my Mac is stored in ~/anaconda3/pkgs/scrapy-1.5.0-py36_0/lib/python3.6/site-packages/scrapy/pipelines/
. 这里的方法直接适用于默认的scrapy管道
images.py
,它在我的Mac上存储在~/anaconda3/pkgs/scrapy-1.5.0-py36_0/lib/python3.6/site-packages/scrapy/pipelines/
。 Includes and additional methods can be copied from that file as needed. 可以根据需要从该文件中复制包含和其他方法。
the solution that @neelix give is the best one , but i'm trying to use it and i found some strange results , some documents are moved but not all the documents. @neelix提供的解决方案是最好的,但我正在尝试使用它,我发现了一些奇怪的结果,一些文件被移动但不是所有文件。 So i replaced :
所以我换了:
if not os.rename(path, target_path):
raise DropItem("Could not move image to target folder")
and i imported shutil library , then my code is : 我导入了shutil库,然后我的代码是:
def item_completed(self, results, item, info):
for result in [x for ok, x in results if ok]:
path = result['path']
slug = slugify(item['designer'])
settings = get_project_settings()
storage = settings.get('IMAGES_STORE')
target_path = os.path.join(storage, slug, os.path.basename(path))
path = os.path.join(storage, path)
# If path doesn't exist, it will be created
if not os.path.exists(os.path.join(storage, slug)):
os.makedirs(os.path.join(storage, slug))
shutil.move(path, target_path)
if self.IMAGES_RESULT_FIELD in item.fields:
item[self.IMAGES_RESULT_FIELD] = [x for ok, x in results if ok]
return item
i hope that it will work also for u guys :) 我希望它也适用于你们:)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.