简体   繁体   English

Django:如何在事务中包装批量更新/插入操作?

[英]Django: how to wrap a bulk update/insert operation in transaction?

This is my use case: 这是我的用例:

  • I have multiple celery tasks that run in parallel 我有多个并行运行的芹菜任务
  • Each task could Bulk create or update many objects. 每个任务可以批量创建更新许多对象。 For this I'm using django-bulk 为此,我正在使用django-bulk

So basically I'm using a very convenient function insert_or_update_many : 所以基本上我使用的是一个非常方便的功能insert_or_update_many

  1. it first performs a Select 它首先执行选择
  2. if it finds objects it updates them 如果找到对象,则更新它们
  3. Otherwise it creates them 否则会创建它们

But this introduces problems of concurrency. 但这引入了并发问题。 For example: if an object did not exist during the step 1 then it is added to a list of objects to be inserted later. 例如:如果在步骤1中不存在对象,则将其添加到要插入的对象列表中。 But during this period can happen that another Celery task has created that object and when it tries to perform a bulk insert (step 3) I get an error of duplicate Entry. 但是在此期间可能发生了另一个Celery任务创建了该对象,并且当它尝试执行批量插入时(步骤3),我收到重复Entry的错误。

I guess I need to wrap the 3 steps in a 'blocking' block. 我想我需要将这3个步骤包装在“阻止”块中。 I've read around about Transactions and I've tried to wrap the step 1,2,3 within a with transaction.commit_on_success: block 我已经阅读了有关事务的内容,并尝试with transaction.commit_on_success:块将步骤1,2,3包装到with transaction.commit_on_success:

with transaction.commit_on_success():
    cursor.execute(sql, parameters)
    existing = set(cursor.fetchall())
    if not skip_update:
        # Find the objects that need to be updated
        update_objects = [o for (o, k) in object_keys if k in existing]
        _update_many(model, update_objects, keys=keys, using=using)
    # Find the objects that need to be inserted.
    insert_objects = [o for (o, k) in object_keys if k not in existing]
    # Filter out any duplicates in the insertion
    filtered_objects = _filter_objects(con, insert_objects, key_fields)
    _insert_many(model, filtered_objects, using=using)

But this does not work for me. 但这对我不起作用。 I'm not sure I've got a full understanding of the transactions. 我不确定我是否对交易有充分的了解。 I basically need a block where I can put several operations being sure no other process or thread is accessing (in write) my db resources. 我基本上需要一个块,可以在其中放置几个​​操作,以确保没有其他进程或线程正在访问(写入)我的数据库资源。

I basically need a block where I can put several operations being sure no other process or thread is accessing (in write) my db resources. 我基本上需要一个块,可以在其中放置几个​​操作,以确保没有其他进程或线程正在访问(写入)我的数据库资源。

Django transactions will not, in general, guarantee that for you. Django交易通常不会为您保证。 If you're coming from other areas of computer science you naturally think of a transaction as blocking in this way, but in the database world there are different kinds of locks, at different isolation levels , and they vary for each database. 如果您来自计算机科学的其他领域,那么您自然会以这种方式将事务视为阻塞,但是在数据库世界中,锁的级别不同, 隔离级别也不同,并且每个数据库的锁也不同。 So to ensure that your transactions do this you're going to have to learn about transactions, about locks and their performance characteristics, and about the mechanisms supplied by your database for controlling them. 因此,要确保您的事务能够做到这一点,您将必须了解事务,锁及其性能特征,以及数据库提供的用于控制它们的机制。

However, having a bunch of processes all trying to lock the table in order to carry out competing inserts does not sound like a good idea. 但是,让一堆进程都试图锁定表以执行竞争插入并不是一个好主意。 If collisions were rare you could do a form of optimistic locking and just retry the transaction if it fails. 如果很少发生冲突,则可以执行某种形式的乐观锁定,如果失败则重试事务。 Or perhaps you can direct all of these celery tasks to a single process (there's no performance advantage to parallelizing this if you're going to acquire a table lock anyway). 或者,您可以将所有这些celery任务定向到一个进程中(如果您无论如何都要获取表锁,则并行执行此操作都没有性能优势)。

My suggestion would be to start out by forgetting the bulk operations and just do one row at a time using Django's update_or_create . 我的建议是从忘记批量操作开始,并使用Django的update_or_create一次只执行一行。 As long as your database has constraints that prevent duplicate entries (which it sounds like it does), this should be free of the race conditions you describe above. 只要您的数据库具有防止重复条目的约束(听起来确实如此),那么它就不会出现上述竞争条件。 If the performance really does turn out to be unacceptable, then look into more complex options. 如果性能确实不可接受,那么可以考虑使用更复杂的选项。

Taking the optimistic concurrency approach means that rather than preventing conflicts—by acquiring a table lock, say—you just proceed as normal and then retry the operation if there turns out to be a problem. 采用开放式并发方法意味着您不必像通过获取表锁那样防止冲突,而是可以正常进行,然后在发现问题时重试该操作。 In your case it might look something like: 在您的情况下,它可能类似于:

while True:
    try:
        with transaction.atomic():
            # do your bulk insert / update operation
    except IntegrityError:
        pass
    else:
        break

So if you run into your race condition, the resulting IntegrityError will cause the transaction.atomic() block to roll back any changes that have been made, and the while loop will force a retry of the transaction (where presumably the bulk operation will now see the newly-existing row and mark it for updating rather than insertion). 因此,如果遇到竞争状况,则产生的IntegrityError将导致transaction.atomic()块回滚已进行的所有更改, while循环将强制重试该事务(现在大容量操作现在可能会进行重试)查看新存在的行,并将其标记为要更新而不是插入)。

This kind of approach can work really well if collisions are rare, and really badly if they are frequent. 如果冲突很少发生,则这种方法非常有效,如果冲突频繁发生,则非常糟糕。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM