简体   繁体   English

Python sqlite3 和并发

[英]Python sqlite3 and concurrency

I have a Python program that uses the "threading" module.我有一个使用“线程”模块的 Python 程序。 Once every second, my program starts a new thread that fetches some data from the web, and stores this data to my hard drive.每一秒,我的程序都会启动一个新线程,从 web 中获取一些数据,并将这些数据存储到我的硬盘中。 I would like to use sqlite3 to store these results, but I can't get it to work.我想使用 sqlite3 来存储这些结果,但我无法让它工作。 The issue seems to be about the following line:问题似乎与以下行有关:

conn = sqlite3.connect("mydatabase.db")
  • If I put this line of code inside each thread, I get an OperationalError telling me that the database file is locked.如果我将这行代码放在每个线程中,我会得到一个 OperationalError 告诉我数据库文件已锁定。 I guess this means that another thread has mydatabase.db open through a sqlite3 connection and has locked it.我想这意味着另一个线程通过 sqlite3 连接打开了 mydatabase.db 并将其锁定。
  • If I put this line of code in the main program and pass the connection object (conn) to each thread, I get a ProgrammingError, saying that SQLite objects created in a thread can only be used in that same thread.如果我将这行代码放在主程序中并将连接 object (conn) 传递给每个线程,我会得到一个 ProgrammingError,说在一个线程中创建的 SQLite 对象只能在同一个线程中使用。

Previously I was storing all my results in CSV files, and did not have any of these file-locking issues.以前我将所有结果存储在 CSV 个文件中,并且没有任何这些文件锁定问题。 Hopefully this will be possible with sqlite. Any ideas?希望这可以通过 sqlite 实现。有什么想法吗?

Contrary to popular belief, newer versions of sqlite3 do support access from multiple threads.与流行的看法相反,较新版本的 sqlite3确实支持从多个线程进行访问。

This can be enabled via optional keyword argument check_same_thread :这可以通过可选的关键字参数check_same_thread启用:

sqlite.connect(":memory:", check_same_thread=False)

You can use consumer-producer pattern.您可以使用消费者-生产者模式。 For example you can create queue that is shared between threads.例如,您可以创建线程之间共享的队列。 First thread that fetches data from the web enqueues this data in the shared queue.从 Web 获取数据的第一个线程将这些数据排入共享队列。 Another thread that owns database connection dequeues data from the queue and passes it to the database.另一个拥有数据库连接的线程从队列中取出数据并将其传递给数据库。

Switch to multiprocessing<\/a> .切换到多处理<\/a>。 It is much better, scales well, can go beyond the use of multiple cores by using multiple CPUs, and the interface is the same as using python threading module.它要好得多,可扩展性好,可以通过使用多个CPU来超越使用多核,并且接口与使用python线程模块相同。

Or, as Ali suggested, just use SQLAlchemy's thread pooling mechanism<\/a> .或者,正如阿里建议的那样,只使用SQLAlchemy 的线程池机制<\/a>。 It will handle everything for you automatically and has many extra features, just to quote some of them:它将自动为您处理所有事情,并具有许多额外的功能,仅引用其中一些:

  1. SQLAlchemy includes dialects for SQLite, Postgres, MySQL, Oracle, MS-SQL, Firebird, MaxDB, MS Access, Sybase and Informix; SQLAlchemy 包括 SQLite、Postgres、MySQL、Oracle、MS-SQL、Firebird、MaxDB、MS Access、Sybase 和 Informix 的方言; IBM has also released a DB2 driver. IBM 还发布了一个 DB2 驱动程序。 So you don't have to rewrite your application if you decide to move away from SQLite.因此,如果您决定离开 SQLite,则不必重写您的应用程序。<\/li>
  2. The Unit Of Work system, a central part of SQLAlchemy's Object Relational Mapper (ORM), organizes pending create\/insert\/update\/delete operations into queues and flushes them all in one batch.工作单元系统是 SQLAlchemy 的对象关系映射器 (ORM) 的核心部分,它将挂起的创建\/插入\/更新\/删除操作组织到队列中,并在一批中将它们全部刷新。 To accomplish this it performs a topological "dependency sort" of all modified items in the queue so as to honor foreign key constraints, and groups redundant statements together where they can sometimes be batched even further.为了实现这一点,它对队列中的所有修改项执行拓扑“依赖排序”,以遵守外键约束,并将冗余语句组合在一起,有时它们可​​以被进一步批处理。 This produces the maxiumum efficiency and transaction safety, and minimizes chances of deadlocks.这产生了最大的效率和交易安全,并最大限度地减少了死锁的机会。<\/li><\/ol>"

You shouldn't be using threads at all for this.您根本不应该为此使用线程。 This is a trivial task for twisted<\/a> and that would likely take you significantly further anyway.这对twisted<\/a>来说是一项微不足道的任务,无论如何这可能会让你走得更远。

Use only one thread, and have the completion of the request trigger an event to do the write.只使用一个线程,并让请求完成触发一个事件来进行写入。

twisted will take care of the scheduling, callbacks, etc... for you. twisted 会为您处理日程安排、回调等。 It'll hand you the entire result as a string, or you can run it through a stream-processor (I have a twitter API<\/a> and a friendfeed API<\/a> that both fire off events to callers as results are still being downloaded).它会将整个结果作为字符串传递给您,或者您可以通过流处理器运行它(我有一个twitter API<\/a>和一个friendfeed API<\/a> ,它们都会在结果仍在下载时向调用者触发事件)。

Depending on what you're doing with your data, you could just dump the full result into sqlite as it's complete, cook it and dump it, or cook it while it's being read and dump it at the end.根据您对数据的处理方式,您可以将完整的结果转储到 sqlite 中,在完成时将其转储并转储,或者在读取数据的同时将其转储并在最后转储。

I have a very simple application that does something close to what you're wanting on github.我有一个非常简单的应用程序,它做的事情接近你在 github 上想要的。 I call it pfetch<\/a> (parallel fetch).我称之为pfetch<\/a> (并行获取)。 It grabs various pages on a schedule, streams the results to a file, and optionally runs a script upon successful completion of each one.它按计划抓取各种页面,将结果流式传输到文件,并在成功完成每个页面后可选择运行脚本。 It also does some fancy stuff like conditional GETs, but still could be a good base for whatever you're doing.它还做了一些花哨的东西,比如条件 GET,但仍然可以作为你正在做的任何事情的一个很好的基础。

"

您需要在对数据库的每个事务<\/em>之后使用session.close()<\/code>以便在同一个线程中使用相同的游标,而不是在导致此错误的多线程中使用相同的游标。

"

I could not find any benchmarks in any of the above answers so I wrote a test to benchmark everything.我在上述任何答案中都找不到任何基准,因此我编写了一个测试来对所有内容进行基准测试。

I tried 3 approaches我尝试了 3 种方法

  1. Reading and writing sequentially from the SQLite database从 SQLite 数据库顺序读取和写入
  2. Using a ThreadPoolExecutor to read/write使用 ThreadPoolExecutor 读/写
  3. Using a ProcessPoolExecutor to read/write使用 ProcessPoolExecutor 读/写

The results and takeaways from the benchmark are as follows基准测试的结果和要点如下

  1. Sequential reads/sequential writes work the best顺序读取/顺序写入效果最好
  2. If you must process in parallel, use the ProcessPoolExecutor to read in parallel如果一定要并行处理,使用ProcessPoolExecutor并行读取
  3. Do not perform any writes either using the ThreadPoolExecutor or using the ProcessPoolExecutor as you will run into database locked errors and you will have to retry inserting the chunk again不要使用 ThreadPoolExecutor 或 ProcessPoolExecutor 执行任何写入,因为您将遇到数据库锁定错误,您将不得不再次重试插入块

You can find the code and complete solution for the benchmarks in my SO answer HERE Hope that helps!您可以在我的 SO 答案中找到基准测试的代码和完整解决方案,希望对您有所帮助!

I would take a look at the y_serial Python module for data persistence: http://yserial.sourceforge.net我会看一下用于数据持久性的 y_serial Python 模块:http: //yserial.sourceforge.net

which handles deadlock issues surrounding a single SQLite database.它处理围绕单个 SQLite 数据库的死锁问题。 If demand on concurrency gets heavy one can easily set up the class Farm of many databases to diffuse the load over stochastic time.如果对并发性的需求变得很重,可以轻松地设置许多数据库的类 Farm 来分散随机时间的负载。

Hope this helps your project... it should be simple enough to implement in 10 minutes.希望这对您的项目有所帮助......它应该足够简单,可以在 10 分钟内实施。

I like Evgeny's answer - Queues are generally the best way to implement inter-thread communication.我喜欢 Evgeny 的回答 - 队列通常是实现线程间通信的最佳方式。 For completeness, here are some other options:为了完整起见,这里有一些其他选项:

  • Close the DB connection when the spawned threads have finished using it.当生成的线程完成使用它时关闭数据库连接。 This would fix your OperationalError<\/code> , but opening and closing connections like this is generally a No-No, due to performance overhead.这将修复您的OperationalError<\/code> ,但是由于性能开销,像这样打开和关闭连接通常是不可以的。<\/li>
  • Don't use child threads.不要使用子线程。 If the once-per-second task is reasonably lightweight, you could get away with doing the fetch and store, then sleeping until the right moment.如果每秒一次的任务是相当轻量级的,你可以做 fetch 和 store,然后睡到合适的时刻。 This is undesirable as fetch and store operations could take >1sec, and you lose the benefit of multiplexed resources you have with a multi-threaded approach.这是不可取的,因为获取和存储操作可能需要超过 1 秒的时间,并且您会失去使用多线程方法所拥有的多路复用资源的好处。<\/li><\/ul>"

You need to design the concurrency for your program.您需要为您的程序设计并发性。 SQLite has clear limitations and you need to obey them, see the FAQ<\/a> (also the following question). SQLite 有明确的限制,您需要遵守它们,请参阅常见问题解答<\/a>(也是以下问题)。

"

Please consider checking the value of THREADSAFE for the pragma_compile_options of your SQLite installation.请考虑检查 SQLite 安装的pragma_compile_optionsTHREADSAFE值。 For instance, with例如,与

SELECT * FROM pragma_compile_options;

If THREADSAFE is equal to 1, then your SQLite installation is threadsafe, and all you gotta do to avoid the threading exception is to create the Python connection with checksamethread equal to False .如果THREADSAFE等于 1,那么您的 SQLite 安装是线程安全的,您为避免线程异常所要做的就是创建 Python 连接, checksamethread等于False In your case, it means在你的情况下,这意味着

conn = sqlite3.connect("mydatabase.db", checksamethread=False)

That's explained in some detail in Python, SQLite, and thread safety这在Python、SQLite 和线程安全中有详细解释

The most likely reason you get errors with locked databases is that you must issue您收到锁定数据库错误的最可能原因是您必须发出

conn.commit()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM