简体   繁体   English

django模型对象实例应该传递给芹菜吗?

[英]Should django model object instances be passed to celery?

# models.py
from django.db import models

class Person(models.Model):
    first_name = models.CharField(max_length=30)
    last_name = models.CharField(max_length=30)
    text_blob = models.CharField(max_length=50000)

# tasks.py
import celery
@celery.task
def my_task(person):
    # example operation: does something to person 
    # needs only a few of the attributes of person
    # and not the entire bulky record
    person.first_name = person.first_name.title()
    person.last_name = person.last_name.title()
    person.save()

In my application somewhere I have something like: 在我的应用程序的某个地方我有类似的东西:

from models import Person
from tasks import my_task
import celery
g = celery.group([my_task.s(p) for p in Person.objects.all()])
g.apply_async()
  • Celery pickles p to send it to the worker right? 芹菜泡菜将它发送给工人吗?
  • If the workers are running on multiple machines, would the entire person object (along with the bulky text_blob which is primarily not required) be transmitted over the network? 如果工作人员在多台计算机上运行,​​那么整个人对象(以及主要不需要的庞大的text_blob)是否会通过网络传输? Is there a way to avoid it? 有没有办法避免它?
  • How can I efficiently and evenly distribute the Person records to workers running on multiple machines? 如何有效且均匀地将Person记录分发给在多台机器上运行的工作人员?

  • Could this be a better idea? 这可能是一个更好的主意吗? Wouldn't it overwhelm the db if Person has a few million records? 如果Person有几百万条记录,难道不会压倒db吗?

     # tasks.py import celery from models import Person @celery.task def my_task(person_pk): # example operation that does not need text_blob person = Person.objects.get(pk=person_pk) person.first_name = person.first_name.title() person.last_name = person.last_name.title() person.save() #In my application somewhere from models import Person from tasks import my_task import celery g = celery.group([my_task.s(p.pk) for p in Person.objects.all()]) g.apply_async() 

I believe it is better and safer to pass PK rather than the whole model object. 我相信通过PK而不是整个模型对象更好更安全。 Since PK is just a number, serialization is also much simpler. 由于PK只是一个数字,序列化也更简单。 Most importantly, you can use a safer sarializer (json/yaml instead of pickle) and have a peace of mind that you won't have any problems with serializing your model. 最重要的是,您可以使用更安全的sarializer(json / yaml而不是pickle),并且可以放心,您在序列化模型时不会遇到任何问题。

As this article says: 由于文章说:

Since Celery is a distributed system, you can't know in which process, or even on what machine the task will run. 由于Celery是一个分布式系统,因此您无法知道在哪个进程中运行,甚至无法知道运行任务的机器。 So you shouldn't pass Django model objects as arguments to tasks, its almost always better to re-fetch the object from the database instead, as there are possible race conditions involved. 因此,您不应该将Django模型对象作为参数传递给任务,而是从数据库中重新获取对象几乎总是更好,因为可能存在竞争条件。

Yes. 是。 If there are millions of records in the database then this probably isn't the best approach, but since you have to go through all many millions of the records, then pretty much no matter what you do, your DB is going to get hit pretty hard. 如果数据库中有数百万条记录,那么这可能不是最好的方法,但由于你必须经历所有数百万条记录,那么无论你做什么,你的数据库都会受到很大影响。硬。

Here are some alternatives, none of which I'd call "better", just different. 以下是一些替代方案,其中没有一个我称之为“更好”,只是不同。

  1. Implement a pre_save signal handler for your Person class that does the .title() stuff. 为你的Person类实现一个pre_save信号处理程序,它执行.title()的东西。 That way your first_name/last_names will always get stored correctly in the db and you'll not have to do this again. 这样,您的first_name / last_names将始终正确存储在数据库中,您不必再次执行此操作。
  2. Use a management command that takes some kind of paging parameter...perhaps use the first letter of the last name to segment the Persons. 使用带有某种分页参数的管理命令...也许使用姓氏的第一个字母来分割人员。 So running ./manage.py my_task a would update all the records where the last name starts with "a". 所以运行./manage.py my_task a会更新姓氏以“a”开头的所有记录。 Obviously you'd have to run this several times to get through the whole database 显然你必须多次运行才能通过整个数据库
  3. Maybe you can do it with some creative sql. 也许你可以用一些创造性的SQL来做到这一点。 I'm not even going to attempt here, but it might be worth investigating. 我甚至不打算在这里尝试,但它可能值得调查。

Keep in mind that the .save() is going to be the harder "hit" to the database then actually selecting the millions of records. 请记住,.save()将更难“击中”数据库,然后实际选择数百万条记录。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM