[英]Should django model object instances be passed to celery?
# models.py
from django.db import models
class Person(models.Model):
first_name = models.CharField(max_length=30)
last_name = models.CharField(max_length=30)
text_blob = models.CharField(max_length=50000)
# tasks.py
import celery
@celery.task
def my_task(person):
# example operation: does something to person
# needs only a few of the attributes of person
# and not the entire bulky record
person.first_name = person.first_name.title()
person.last_name = person.last_name.title()
person.save()
In my application somewhere I have something like: 在我的应用程序的某个地方我有类似的东西:
from models import Person
from tasks import my_task
import celery
g = celery.group([my_task.s(p) for p in Person.objects.all()])
g.apply_async()
How can I efficiently and evenly distribute the Person records to workers running on multiple machines? 如何有效且均匀地将Person记录分发给在多台机器上运行的工作人员?
Could this be a better idea? 这可能是一个更好的主意吗? Wouldn't it overwhelm the db if Person has a few million records? 如果Person有几百万条记录,难道不会压倒db吗?
# tasks.py import celery from models import Person @celery.task def my_task(person_pk): # example operation that does not need text_blob person = Person.objects.get(pk=person_pk) person.first_name = person.first_name.title() person.last_name = person.last_name.title() person.save() #In my application somewhere from models import Person from tasks import my_task import celery g = celery.group([my_task.s(p.pk) for p in Person.objects.all()]) g.apply_async()
I believe it is better and safer to pass PK rather than the whole model object. 我相信通过PK而不是整个模型对象更好更安全。 Since PK is just a number, serialization is also much simpler. 由于PK只是一个数字,序列化也更简单。 Most importantly, you can use a safer sarializer (json/yaml instead of pickle) and have a peace of mind that you won't have any problems with serializing your model. 最重要的是,您可以使用更安全的sarializer(json / yaml而不是pickle),并且可以放心,您在序列化模型时不会遇到任何问题。
Since Celery is a distributed system, you can't know in which process, or even on what machine the task will run. 由于Celery是一个分布式系统,因此您无法知道在哪个进程中运行,甚至无法知道运行任务的机器。 So you shouldn't pass Django model objects as arguments to tasks, its almost always better to re-fetch the object from the database instead, as there are possible race conditions involved. 因此,您不应该将Django模型对象作为参数传递给任务,而是从数据库中重新获取对象几乎总是更好,因为可能存在竞争条件。
Yes. 是。 If there are millions of records in the database then this probably isn't the best approach, but since you have to go through all many millions of the records, then pretty much no matter what you do, your DB is going to get hit pretty hard. 如果数据库中有数百万条记录,那么这可能不是最好的方法,但由于你必须经历所有数百万条记录,那么无论你做什么,你的数据库都会受到很大影响。硬。
Here are some alternatives, none of which I'd call "better", just different. 以下是一些替代方案,其中没有一个我称之为“更好”,只是不同。
./manage.py my_task a
would update all the records where the last name starts with "a". 所以运行./manage.py my_task a
会更新姓氏以“a”开头的所有记录。 Obviously you'd have to run this several times to get through the whole database 显然你必须多次运行才能通过整个数据库 Keep in mind that the .save() is going to be the harder "hit" to the database then actually selecting the millions of records. 请记住,.save()将更难“击中”数据库,然后实际选择数百万条记录。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.