[英]Python sqlite3 and concurrency
I have a Python program that uses the "threading" module.我有一个使用“线程”模块的 Python 程序。 Once every second, my program starts a new thread that fetches some data from the web, and stores this data to my hard drive.
每一秒,我的程序都会启动一个新线程,从 web 中获取一些数据,并将这些数据存储到我的硬盘中。 I would like to use sqlite3 to store these results, but I can't get it to work.
我想使用 sqlite3 来存储这些结果,但我无法让它工作。 The issue seems to be about the following line:
问题似乎与以下行有关:
conn = sqlite3.connect("mydatabase.db")
Previously I was storing all my results in CSV files, and did not have any of these file-locking issues.以前我将所有结果存储在 CSV 个文件中,并且没有任何这些文件锁定问题。 Hopefully this will be possible with sqlite. Any ideas?
希望这可以通过 sqlite 实现。有什么想法吗?
Contrary to popular belief, newer versions of sqlite3 do support access from multiple threads.与流行的看法相反,较新版本的 sqlite3确实支持从多个线程进行访问。
This can be enabled via optional keyword argument check_same_thread
:这可以通过可选的关键字参数
check_same_thread
启用:
sqlite.connect(":memory:", check_same_thread=False)
You can use consumer-producer pattern.您可以使用消费者-生产者模式。 For example you can create queue that is shared between threads.
例如,您可以创建线程之间共享的队列。 First thread that fetches data from the web enqueues this data in the shared queue.
从 Web 获取数据的第一个线程将这些数据排入共享队列。 Another thread that owns database connection dequeues data from the queue and passes it to the database.
另一个拥有数据库连接的线程从队列中取出数据并将其传递给数据库。
The following found on mail.python.org.pipermail.1239789<\/a>在
mail.python.org.pipermail.1239789<\/a>上找到以下内容
I have found the solution.我找到了解决方案。 I don't know why python documentation has not a single word about this option.
我不知道为什么 python 文档没有关于这个选项的一个字。 So we have to add a new keyword argument to connection function and we will be able to create cursors out of it in different thread.
所以我们必须向连接函数添加一个新的关键字参数,我们将能够在不同的线程中创建游标。 So use:
所以使用:
sqlite.connect(":memory:", check_same_thread = False)
Switch to multiprocessing<\/a> .切换到
多处理<\/a>。 It is much better, scales well, can go beyond the use of multiple cores by using multiple CPUs, and the interface is the same as using python threading module.
它要好得多,可扩展性好,可以通过使用多个CPU来超越使用多核,并且接口与使用python线程模块相同。
Or, as Ali suggested, just use
SQLAlchemy's thread pooling mechanism<\/a> .或者,正如阿里建议的那样,只使用
SQLAlchemy 的线程池机制<\/a>。 It will handle everything for you automatically and has many extra features, just to quote some of them:
它将自动为您处理所有事情,并具有许多额外的功能,仅引用其中一些:
You shouldn't be using threads at all for this.您根本不应该为此使用线程。 This is a trivial task for twisted<\/a> and that would likely take you significantly further anyway.
这对
twisted<\/a>来说是一项微不足道的任务,无论如何这可能会让你走得更远。
twisted will take care of the scheduling, callbacks, etc... for you. twisted 会为您处理日程安排、回调等。 It'll hand you the entire result as a string, or you can run it through a stream-processor (I have a
twitter API<\/a> and a friendfeed API<\/a> that both fire off events to callers as results are still being downloaded).它会将整个结果作为字符串传递给您,或者您可以通过流处理器运行它(我有一个
twitter API<\/a>和一个friendfeed API<\/a> ,它们都会在结果仍在下载时向调用者触发事件)。
I have a very simple application that does something close to what you're wanting on github.我有一个非常简单的应用程序,它做的事情接近你在 github 上想要的。 I call it
pfetch<\/a> (parallel fetch).我称之为
pfetch<\/a> (并行获取)。 It grabs various pages on a schedule, streams the results to a file, and optionally runs a script upon successful completion of each one.
它按计划抓取各种页面,将结果流式传输到文件,并在成功完成每个页面后可选择运行脚本。 It also does some fancy stuff like conditional GETs, but still could be a good base for whatever you're doing.
它还做了一些花哨的东西,比如条件 GET,但仍然可以作为你正在做的任何事情的一个很好的基础。
您需要在对数据库的每个事务<\/em>之后使用session.close()<\/code>以便在同一个线程中使用相同的游标,而不是在导致此错误的多线程中使用相同的游标。
"
I could not find any benchmarks in any of the above answers so I wrote a test to benchmark everything.我在上述任何答案中都找不到任何基准,因此我编写了一个测试来对所有内容进行基准测试。
I tried 3 approaches我尝试了 3 种方法
The results and takeaways from the benchmark are as follows基准测试的结果和要点如下
You can find the code and complete solution for the benchmarks in my SO answer HERE Hope that helps!您可以在我的 SO 答案中找到基准测试的代码和完整解决方案,希望对您有所帮助!
I would take a look at the y_serial Python module for data persistence: http://yserial.sourceforge.net我会看一下用于数据持久性的 y_serial Python 模块:http: //yserial.sourceforge.net
which handles deadlock issues surrounding a single SQLite database.它处理围绕单个 SQLite 数据库的死锁问题。 If demand on concurrency gets heavy one can easily set up the class Farm of many databases to diffuse the load over stochastic time.
如果对并发性的需求变得很重,可以轻松地设置许多数据库的类 Farm 来分散随机时间的负载。
Hope this helps your project... it should be simple enough to implement in 10 minutes.希望这对您的项目有所帮助......它应该足够简单,可以在 10 分钟内实施。
I like Evgeny's answer - Queues are generally the best way to implement inter-thread communication.我喜欢 Evgeny 的回答 - 队列通常是实现线程间通信的最佳方式。 For completeness, here are some other options:
为了完整起见,这里有一些其他选项:
OperationalError<\/code> , but opening and closing connections like this is generally a No-No, due to performance overhead.
这将修复您的OperationalError<\/code> ,但是由于性能开销,像这样打开和关闭连接通常是不可以的。
<\/li>
Don't use child threads.不要使用子线程。 If the once-per-second task is reasonably lightweight, you could get away with doing the fetch and store, then sleeping until the right moment.
如果每秒一次的任务是相当轻量级的,你可以做 fetch 和 store,然后睡到合适的时刻。 This is undesirable as fetch and store operations could take >1sec, and you lose the benefit of multiplexed resources you have with a multi-threaded approach.
这是不可取的,因为获取和存储操作可能需要超过 1 秒的时间,并且您会失去使用多线程方法所拥有的多路复用资源的好处。<\/li><\/ul>"
You need to design the concurrency for your program.您需要为您的程序设计并发性。 SQLite has clear limitations and you need to obey them, see the FAQ<\/a> (also the following question).
SQLite 有明确的限制,您需要遵守它们,请参阅
常见问题解答<\/a>(也是以下问题)。
Please consider checking the value of THREADSAFE
for the pragma_compile_options
of your SQLite installation.请考虑检查 SQLite 安装的
pragma_compile_options
的THREADSAFE
值。 For instance, with例如,与
SELECT * FROM pragma_compile_options;
If THREADSAFE
is equal to 1, then your SQLite installation is threadsafe, and all you gotta do to avoid the threading exception is to create the Python connection with checksamethread
equal to False
.如果
THREADSAFE
等于 1,那么您的 SQLite 安装是线程安全的,您为避免线程异常所要做的就是创建 Python 连接, checksamethread
等于False
。 In your case, it means在你的情况下,这意味着
conn = sqlite3.connect("mydatabase.db", checksamethread=False)
That's explained in some detail in Python, SQLite, and thread safety这在Python、SQLite 和线程安全中有详细解释
The most likely reason you get errors with locked databases is that you must issue您收到锁定数据库错误的最可能原因是您必须发出
conn.commit()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.