简体   繁体   English

Flask 中的 Celery 任务,用于上传和调整图像大小并将其存储到 Amazon S3

[英]Celery task in Flask for uploading and resizing images and storing it to Amazon S3

I'm trying to create a celery task for uploading and resizing an image before storing it to Amazon S3.我正在尝试创建一个 celery 任务,用于在将图像存储到 Amazon S3 之前上传和调整图像大小。 But it doesn't work as expected.但它没有按预期工作。 Without the task everything is working fine.没有任务,一切正常。 This is the code so far:这是到目前为止的代码:

stacktrace堆栈跟踪

Traceback (most recent call last):
  File "../myVE/lib/python2.7/site-packages/kombu/messaging.py", line 579, in _receive_callback
    decoded = None if on_m else message.decode()
  File "../myVE/lib/python2.7/site-packages/kombu/transport/base.py", line 147, in decode
    self.content_encoding, accept=self.accept)
  File "../myVE/lib/python2.7/site-packages/kombu/serialization.py", line 187, in decode
    return decode(data)
  File "../myVE/lib/python2.7/site-packages/kombu/serialization.py", line 74, in pickle_loads
    return load(BytesIO(s))
  File "../myVE/lib/python2.7/site-packages/werkzeug/datastructures.py", line 2595, in __getattr__
    return getattr(self.stream, name)
  File "../myVE/lib/python2.7/site-packages/werkzeug/datastructures.py", line 2595, in __getattr__
    return getattr(self.stream, name)
    ...
RuntimeError: maximum recursion depth exceeded while calling a Python object

views.py视图.py

from PIL import Image

from flask import Blueprint, redirect, render_template, request, url_for

from myapplication.forms import UploadForm
from myapplication.tasks import upload_task


main = Blueprint('main', __name__)

@main.route('/upload', methods=['GET', 'POST'])
def upload():
    form = UploadForm()
    if form.validate_on_submit():
        upload_task.delay(form.title.data, form.description.data,
                          Image.open(request.files['image']))
        return redirect(url_for('main.index'))
    return render_template('upload.html', form=form)

tasks.py任务.py

from StringIO import StringIO

from flask import current_app

from myapplication.extensions import celery, db
from myapplication.helpers import resize, s3_upload
from myapplication.models import MyObject


@celery.task(name='tasks.upload_task')
def upload_task(title, description, source):
    stream = StringIO()
    target = resize(source, current_app.config['SIZE'])
    target.save(stream, 'JPEG', quality=95)
    stream.seek(0)
    obj = MyObject(title=title, description=description, url=s3_upload(stream))
    db.session.add(obj)
    db.session.commit()

I know this is a very old question, but I was struggling with passing the file's contents to the celery task.我知道这是一个非常古老的问题,但我一直在努力将文件的内容传递给 celery 任务。 I would keep getting errors trying to follow what others have done.我会不断收到错误,试图跟随其他人所做的事情。 So I wrote this up, hoping it may help others in the future.所以我写了这个,希望它可以在未来帮助其他人。

TL;DR TL;博士

  • Send the file contents to the celery task with base64 encoding使用 base64 编码将文件内容发送到 celery 任务
  • Decode the data in the celery task and use io.BytesIO for the stream解码 celery 任务中的数据并将io.BytesIO用于流

Long answer长答案

I was not interested in saving the image to disk and reading it again, so I wanted to pass the needed data to reconstruct the file in the background.我对将图像保存到磁盘并再次读取它不感兴趣,因此我想传递所需的数据以在后台重建文件。

Trying to follow what others suggest, I kept getting encoding errors.试图遵循其他人的建议,我不断收到编码错误。 Some of the errors were:一些错误是:

  • UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
  • TypeError: initial_value must be str or None, not bytes

The TypeError was thrown by the io.StringIO . TypeError是由io.StringIO抛出的。 Trying to decode the data to get rid of the UnicodeDecodeError did not make much sense.尝试解码数据以消除UnicodeDecodeError没有多大意义。 As the data is binary in the first place, I tried to use a io.BytesIO instance, and that worked perfectly.由于数据首先是二进制的,因此我尝试使用io.BytesIO实例,并且效果很好。 The only thing I needed to do was to encode the file's stream with base64 and then I would be able to pass the content to the celery task.我唯一需要做的就是用 base64 对文件的流进行编码,然后我就可以将内容传递给 celery 任务。

Code samples代码示例

images.py图像.py

import base64

file_.stream.seek(0) # start from beginning of file
# some of the data may not be defined
data = {
  'stream': base64.b64encode(file_.read()),
  'name': file_.name,
  'filename': file_.filename,
  'content_type': file_.content_type,
  'content_length': file_.content_length,
  'headers': {header[0]: header[1] for header in file_.headers}
}

###
# add logic to sanitize required fields
###

# define the params for the upload (here I am using AWS S3)
bucket, s3_image_path = AWS_S3_BUCKET, AWS_S3_IMAGE_PATH
# import and call the background task
from async_tasks import upload_async_photo 
upload_async_photo.delay(
  data=data,
  image_path=s3_image_path,
  bucket=bucket)

async_tasks异步任务

import base64, io
from werkzeug.datastructures import FileStorage

@celery.task
def upload_async_photo(data, image_path, bucket):
    bucket = get_s3_bucket(bucket) # get bucket instance
    try:
        # decode the stream
        data['stream'] = base64.b64decode(data['stream'])
        # create a BytesIO instance
        # https://docs.python.org/3/library/io.html#binary-i-o
        data['stream'] = io.BytesIO(data['stream'])
        # create the file structure
        file_ = FileStorage(**data)
        # upload image
        bucket.put_object(
                Body=file_,
                Key=image_path,
                ContentType=data['content_type'])
    except Exception as e:
        print(str(e))

Edit编辑

I also changed what content celery accepts and how it serializes data.我还更改了 celery 接受的内容以及它如何序列化数据。 To avoid having trouble passing the Bytes instance to the celery task, I had to add the following to my config:为了避免将 Bytes 实例传递给 celery 任务时遇到问题,我必须将以下内容添加到我的配置中:

CELERY_ACCEPT_CONTENT = ['pickle']
CELERY_TASK_SERIALIZER = 'pickle'
CELERY_RESULT_SERIALIZER = 'pickle'

It looks like you are attempting to pass the entire uploaded file as part of the Celery message.看起来您正试图将整个上传的文件作为 Celery 消息的一部分传递。 I imagine that is causing you some trouble.我想这会给你带来一些麻烦。 I would recommend seeing if you can save the file to the web server as part of the view, then have the message (the "delay" argument) contain the filename rather than entire file's data.我建议您查看是否可以将文件作为视图的一部分保存到 Web 服务器,然后让消息(“延迟”参数)包含文件名而不是整个文件的数据。 The task can then read the file in from the hard drive, upload to s3, then delete it locally.然后任务可以从硬盘读取文件,上传到s3,然后在本地删除。

I understand this is a very old post but just in case it helps someone - the best way forward in cases like this would be to download the image from an external source and then do the async operation.我知道这是一篇非常古老的帖子,但以防万一它对某人有所帮助 - 在这种情况下,最好的方法是从外部源下载图像,然后执行异步操作。

I could get a similar async problem working after fixing the serialization issue as suggested by @Obeyed (didn't need to change celery config though), but I eventually moved away from the solution because file contents can potentially be very large and consume lot of resources in the message broker .按照@Obeyed的建议修复序列化问题后,我可能会遇到类似的异步问题(虽然不需要更改芹菜配置),但我最终放弃了解决方案,因为文件内容可能非常大并且消耗大量消息代理中的资源

@Mark Hildreth's approach would not be very helpful if you want to delegate the async task to a worker machine.如果您想将异步任务委托给工作机器,@Mark Hildreth 的方法不会很有帮助。

Perhaps a better approach in this case would have been uploading the original image synchronously and then asynchronously downloading, resizing and re-uploading the image to replace the original one.在这种情况下,也许更好的方法是同步上传原始图像,然后异步下载、调整大小并重新上传图像以替换原始图像。

Old question, but I have just had the same problem.老问题,但我刚刚遇到了同样的问题。 Accepted answer did not work for me (I'm using Docker instances so Celery does not have access to producers filesystem. Also, its slow to first save file to local filesystem).接受的答案对我不起作用(我正在使用 Docker 实例,因此 Celery 无法访问生产者文件系统。此外,首先将文件保存到本地文件系统很慢)。

My solution keeps the file in the RAM.我的解决方案将文件保存在 RAM 中。 It's therefore much faster.因此速度要快得多。 Only downside is if you need to handle large files (>1GB), then you need a server with a lot of RAM.唯一的缺点是如果您需要处理大文件(>1GB),那么您需要一台具有大量 RAM 的服务器。

The doc_file is of type werkzeug.datastructure.FileStorage ( see docs here ) doc_file 的类型为werkzeug.datastructure.FileStorage请参阅此处的文档

Sending the file to celery worker:将文件发送给 celery worker:

entry.delay(doc_file.read(), doc_file.filename, doc_file.name, doc_file.content_length, doc_file.content_type, doc_file.headers)

Receiving the file:接收文件:

from werkzeug.datastructures import FileStorage
from StringIO import StringIO

@celery.task()
def entry(stream, filename, name, content_length, content_type, headers):
    doc = FileStorage(stream=StringIO(stream), filename=filename, name=name, content_type=content_type, content_length=content_length)
    # Do something with the file (e.g save to Amazon S3)

Passing files to a celery task is a problem because of file's serialization.由于文件的序列化,将文件传递给 celery 任务是一个问题。 It could be converted into bytes but it takes more memory than desired.它可以转换为字节,但它占用的内存比预期的要多。 It makes sense to share the file_path but if you are using docker images, they do not share the same paths.共享 file_path 是有意义的,但如果您使用的是 docker 镜像,它们不会共享相同的路径。

So, my solution is to create shared volume between my api(web) container and celery container for them to have files shared.所以,我的解决方案是在我的 api(web) 容器和 celery 容器之间创建共享卷,以便它们共享文件。

api:
build: .
ports:
  - 8000:8000
volumes:
  - ./:/usr/src/app

celery_worker:
container_name: celery_worker
build: .
command: celery -A celery_worker.celery worker --loglevel=info
volumes:
  - .:/usr/src/app

My celery task is:我的芹菜任务是:

@celery.task
def test(file_path):
    print('Current directory in celery container is:' + os.getcwd())
    images_dir = os.getcwd()
    final_dir = os.path.join(images_dir,file_path)
    print(f'Filepath in celery task is : {final_dir}')
    with open(final_dir,"rb") as f:

        s3_client.upload_fileobj(f, bucket_name, file_path)

My code to call the task in fastapi endpoint is:我在 fastapi 端点中调用任务的代码是:

with open( file_path, "wb+" ) as file_object:
    file_object.write( uploaded_file.file.read() )
test.delay(file_path)

So, the best solution I found so far is to save the file in a shared volume between containers, pass the file path to the celery worker, upload the file to the s3 bucket and then remove the file in the volume.所以,到目前为止我找到的最佳解决方案是将文件保存在容器之间的共享卷中,将文件路径传递给 celery worker,将文件上传到 s3 存储桶,然后删除卷中的文件。

Hope it helps.希望能帮助到你。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM