简体   繁体   English

Python多处理-将结果写入同一个文件

[英]Python multiprocessing- write the results in the same file

I have a simple function that writes the output of some calculations in a sqlite table.我有一个简单的函数,可以在 sqlite 表中写入一些计算的输出。 I would like to use this function in parallel using multi-processing in Python.我想在 Python 中使用多处理并行使用此函数。 My specific question is how to avoid conflict when each process tries to write its result into the same table?我的具体问题是当每个进程试图将其结果写入同一个表时如何避免冲突? Running the code gives me this error: sqlite3.OperationalError: database is locked.运行代码给了我这个错误:sqlite3.OperationalError:数据库被锁定。

import sqlite3
from multiprocessing import Pool

conn = sqlite3.connect('test.db')
c = conn.cursor()
c.execute("CREATE TABLE table_1 (id int,output int)")

def write_to_file(a_tuple):
    index = a_tuple[0]
    input = a_tuple[1]
    output = input + 1
    c.execute('INSERT INTO table_1 (id, output)' 'VALUES (?,?)', (index,output))

if __name__ == "__main__":
    p = Pool()
    results = p.map(write_to_file, [(1,10),(2,11),(3,13),(4,14)])
    p.close()
    p.join()

Traceback (most recent call last):
sqlite3.OperationalError: database is locked

Using a Pool is a good idea.使用Pool是个好主意。

I see three possible solutions to this problem.我看到了这个问题的三种可能的解决方案。

First, instead of having the pool worker trying to insert data into the database, let the worker return the data to the parent process.首先,不要让池工作者尝试将数据插入数据库,而是让工作者将数据返回给父进程。

In the parent process, use imap_unordered instead of map .在父进程中,使用imap_unordered而不是map This is an iterable that starts providing values as soon as they become available.这是一个迭代器,一旦值可用就开始提供值。 The parent can than insert the data into the database.然后父级可以将数据插入到数据库中。

This will serialize the access to the database, preventing the problem.这将序列化对数据库的访问,防止出现问题。

This solution would be preferred if the data to be inserted into the database is relatively small, but updates happen very often.如果要插入数据库的数据相对较小,但更新频繁发生,则该解决方案将是首选。 So if it takes the same or more time to update the database than to calculate the data.因此,如果更新数据库所需的时间与计算数据的时间相同或更多。


Second, you could use a Lock .其次,您可以使用Lock A worker should then那么工人应该

  • acquire the lock,获取锁,
  • open the database,打开数据库,
  • insert the values,插入值,
  • close the database,关闭数据库,
  • release the lock.释放锁。

This will avoid the overhead of sending the data to the parent process.这将避免将数据发送到父进程的开销。 But instead you may have workers stalling waiting to write their data into a database.但相反,您可能会让工作人员等待将他们的数据写入数据库。

This would be a preferred solution if the amount of data to be inserted is large but it takes much longer to calculate the data than to insert it into the database.如果要插入的数据量很大,但计算数据比将其插入数据库所需的时间要长得多,这将是首选解决方案。


Third, you could have each worker write to its own database, and merge them afterwards.第三,您可以让每个工作人员写入自己的数据库,然后合并它们。 You can do this directly in sqlite or even in Python .您可以直接在 sqlite甚至Python 中执行此操作。 Although with a large amount of data I'm not sure if the latter has advantages.尽管有大量数据,但我不确定后者是否具有优势。

The database is locked to protect your data from corruption.数据库被锁定以保护您的数据免受损坏。

I believe you cannot have many processes accessing the same database at the same time, at least NOT with我相信你不能有多个进程同时访问同一个数据库,至少不是

conn = sqlite3.connect('test.db')
c = conn.cursor()

If each process must access the database, you should consider closing at least the cursor object c (and, perhaps less strictly, the connect object conn ) within each process and reopen it when the process needs it again.如果每个进程都必须访问数据库,那么您应该考虑至少关闭每个进程中的cursor对象c (也许不那么严格, connect对象conn )并在进程再次需要时重新打开它。 Somehow, the other processes need to wait for the current one to release the lock before another process can acquire the lock.不知何故,其他进程需要等待当前进程释放锁,然后另一个进程才能获取锁。 (There are many ways to achieved the waiting). (有很多方法可以实现等待)。

isolation_level设置为'EXCLUSIVE'为我修复了它:

conn = sqlite3.connect('test.db', isolation_level='EXCLUSIVE')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM